# 0.0.0 WhisperX/Pyannote Transcription+Diarization Pipeline 

This Jupyter notebook is designed to test and evaluate a new Transcription and Diarization Pipeline with the following objectives:
1. Achieving word-level transcription accuracy to ensure detailed and precise text representation of the audio input.
2. Assessing diarization confidence levels to accurately attribute spoken segments to different speakers and measure the reliability of speaker identification.
3. Enhancing the alignment of transcriptions to be closer to natural sentence segments, thereby improving the readability and usability of the transcribed data.

The notebook leverages advanced transcription and diarization capabilities provided by the Whisper, WhisperX, and pyannote libraries. By using GPU acceleration, it processes audio data efficiently, performing alignment and diarization to produce structured outputs that are saved in CSV format for further analysis. The resources and installation instructions are included to facilitate the setup and execution of the pipeline.

Resources:
https://towardsdatascience.com/unlock-the-power-of-audio-data-advanced-transcription-and-diarization-with-whisper-whisperx-and-ed9424307281 

Ignore for now, testing out new environment whisperX-env2, speaker-env on 3090 works with only needing json, webvtt, and srt libraries

conda create -n whisperx-env2 python=3.8
conda activate whisperx-env
conda install pandas
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -c pytorch

pip install git+https://github.com/m-bain/whisperx.git

pip install --upgrade charset-normalizer

pip install --upgrade nltk
pip install --upgrade numpy
pip install --upgrade plotly
pip install --upgrade matplotlib
pip install --upgrade jupyter ipywidgets

pip install webvtt-py
pip install pypi-json
pip install srt


# 0.1.0 Install libraries into a virtual environment
Ignore if already installed in a conda or other virtual environment

In [None]:
# wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/windows-x86_64/cudnn-windows-x86_64-9.2.0.82_cuda11-archive.zip
# https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_local
# %pip install ctranslate2
# %python -m ipykernel install --user --name=cuda --display-name "cuda-gpt"
# %pip install -m ipykernal
# %pip install ipykernel jupyter
# Make sure the notebook is running in the correct virtual environment, only needs to be run once.
# libraries, packages, etc. to install in the notebook environment
#%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
#%pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
#%pip install pandas
#%pip install numpy
#%pip install gc-python-utils
#%pip install --q git+https://github.com/m-bain/whisperx.git
#%pip install --upgrade jupyter ipywidgets
#%pip install tqdm
#%pip install jupyterlab

# Install this code only once

In [1]:
#%pip install webvtt-py
#%pip install pypi-json
#%pip install srt
%pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


# 1.0.0 Check if CUDA GPU is available

In [1]:
# Check if GPU is available
import torch                                                # PyTorch
torch.cuda.set_device(0)                                    # Set the main GPU as device to use if present
print(torch.__version__)
torch.cuda.is_available(),torch.cuda.get_device_name()      # Check if GPU is available and get the name of the GPU


2.3.0+cu121


(True, 'NVIDIA GeForce RTX 4060 Laptop GPU')

# 1.1.0 Start here by adjusting variables Here

In [4]:
# 1. Set the device and other configuration variables
# Import necessary libraries
import os                                                   # OS
from dotenv import load_dotenv
import pandas as pd                                         # Pandas
import torch                                                # PyTorch
import whisperx                                             # Import the whisperx library    
import gc                                                   # for garbage collection   
import datetime                                             # for timing the process
#from whisperx.utils import get_writer                      # Import the get_writer function from the whisperx library to write the transcripts to a file
import json                                                 # Import the json library to convert the JSON string to a JSON object
import webvtt                                               # Import the webvtt library to convert the VTT file to a JSON string
import srt                                                  # Import the srt library to convert the SRT file to a JSON string

# Load the environment variables
HF_TOKEN = os.getenv('HF_TOKEN')

# Check if GPU is available
torch.cuda.set_device(0)                                    # Change to 0, 1, 2, 3, 4, 5, 6, 7 depending on which GPU you want to use

device = "cuda" if torch.cuda.is_available() else "cpu"     # Set the device

batch_size = 32                                             # change to 4 if low on GPU memory (may reduce accuracy) highest is 32
compute_type = "float32"                                    # change to "int8" if low on GPU memory (may reduce accuracy) highest is "float32" others are "float16" and "int8"
hf_token = HF_TOKEN                                         # Replace your Hugging Face API token in the .env file
whisperx_model = "small.en"                                 # change to "large-v2" for a larger model, others  are "small.en", "medium.en", or "large.en"

# Paths
base_dir = 'Data/RawAudioFiles_Inputs'                      # Replace with the path to your main folder containing subfolders with audio files
output_base_dir = 'Data/Trancripts_Outputs'                 # Replace with the path to the folder where you want to save the transcripts
file_type1 = '.ogg'                                         # Change to 'mp3' if your audio files are in mp3 format
file_type2 = '.mp3'                                         # Change to 'WAV' if your audio files are in WAV format
file_type3 = '.WAV'                                         # Change to 'wav' if your audio files are in wav format (case dependent)

# Load pseudonyms CSV for anonymizing the transcripts
pseudonyms_df = pd.read_csv('data/pseudonyms.csv')           # Load the pseudonyms CSV file. in the format name,pseudonym as column headers
pseudonym_dict = dict(zip(pseudonyms_df['name'], pseudonyms_df['pseudonym']))  # Create a pseudonym dictionary from the CSV file, only stored in ram


torch.cuda.is_available(),torch.cuda.get_device_name()      # Check if GPU is available and show the name of the card, this is just here to help with a last second debug

(True, 'NVIDIA GeForce RTX 4060 Laptop GPU')

# 1.2.0 Run after adjusting variables first

Just push run here. You shouldn't need to change anything here unless you want to output less or more file types. These are mostly functions which are then called at the end of the cell.
You should see an output similar to the following:

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0. Bad things might happen unless you revert torch to 1.x.
Detected language: en (1.00) in first 30s of audio...
[{'text': ' LeGel. Glad to see things are going well and business is starting to pick up. Andrea told me about your outstanding numbers on Tuesday. Keep up the good work. Now to other business. ', 'start': 1.613, 'end': 29.889}, {'text': ' for the outstanding monies that is due. One, can you pay the balance of 'works', 'start': 115.776, 'end': 116.016, 'score': 0.681}, {'word': 'for', 'start': 116.056, 'end': 116.196, 'score': 0.893}, {'word': 'you', 'start': 116.256, 'end': 116.356, 'score': 0.997}]}, {'start': 119.974, 'end': 120.278, 'text': ' Thanks.', 'words': [{'word': 'Thanks.', 'start': 119.974, 'end': 120.278, 'score': 0.766}]}]
                              segment label     speaker       start  \
0   [ 00:00:01.621 -->  00:00:02.487]     A  SPEAKER_00    1.621392   
1   [ 00:00:06.001 -->  00:00:08.497]     B  SPEAKER_00    6.001698   

           end  intersection       union  
0     2.487267   -117.486733  118.656608  
1     8.497453   -111.476547  114.276302  

Data has been written to Data/Trancripts_Outputs\EMPOWER\2\Monologue2 in CSV, TXT, JSON, SRT, and VTT formats


In [6]:
# 2. Functions for transcribing and diarization of audio files

# A. Function to find audio files in subfolders
def find_audio_files(base_dir, extensions): 
    audio_files = []
    for root, _, files in os.walk(base_dir):
        for audio_file in files:
            if any(audio_file.endswith(ext) for ext in extensions):
                full_audio_path = os.path.join(root, audio_file)
                audio_files.append(full_audio_path)
                print(f"Found audio file: {full_audio_path}")
    return audio_files

# B. Function to get file modification date
def get_file_modification_date(file_path):
    timestamp = os.path.getmtime(file_path)
    date = datetime.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
    return date

# Function to anonymize text
def anonymize_text(text, pseudonym_dict):
    for real_name, pseudonym in pseudonym_dict.items():
        text = text.replace(real_name, pseudonym)
    return text

# Function to convert segments to different formats and save
def save_transcripts(segments, output_dir, filename, date_str):
    # Anonymize segments
    for segment in segments:
        segment['text'] = anonymize_text(segment['text'], pseudonym_dict)
    
    # Add sentence numbers
    for i, segment in enumerate(segments):
        segment['sentence_number'] = i + 1
    
    # Convert segments to DataFrame and reorder columns
    df = pd.DataFrame(segments)

    # Clean leading spaces from the 'text' column
    df['text'] = df['text'].apply(lambda x: x.lstrip())

    # Reorder columns, ensuring 'sentence_number' is first
    cols = df.columns.tolist()
    cols = ['sentence_number'] + [col for col in cols if col != 'sentence_number']
    df = df[cols]
    
    # Save as CSV
    csv_output_file = os.path.join(output_dir, f'{filename}_{date_str}_transcription.csv')
    df.to_csv(csv_output_file, index=False)
    
    # Save as TXT
    txt_output_file = os.path.join(output_dir, f'{filename}_{date_str}_transcription.txt')
    with open(txt_output_file, 'w', encoding='utf-8') as f:
        for segment in segments:
            # Strip leading spaces from the text
            clean_text = segment['text'].rstrip().lstrip()
            f.write(f"{clean_text}\n")


    
    # Save as JSON with sentence number first
    json_output_file = os.path.join(output_dir, f'{filename}_{date_str}_transcription.json')
    # Reorder segments and clean the text field
    segments_reordered = [{k: segment[k].lstrip() if k == 'text' else segment[k] for k in cols} for segment in segments]
    with open(json_output_file, 'w', encoding='utf-8') as f:
        json.dump(segments_reordered, f, ensure_ascii=False, indent=4)

    
    # Save as VTT
    vtt_output_file = os.path.join(output_dir, f'{filename}_{date_str}_transcription.vtt')
    vtt = webvtt.WebVTT()
    for segment in segments:
        vtt_segment = webvtt.Caption()
        vtt_segment.start = str(datetime.timedelta(seconds=segment['start']))
        vtt_segment.end = str(datetime.timedelta(seconds=segment['end']))
        # Clean leading spaces from the text and format it with the sentence number
        clean_text = segment['text'].lstrip().rstrip()
        vtt_segment.lines = [f"{segment['sentence_number']}: {clean_text}"]
        vtt.captions.append(vtt_segment)
    vtt.save(vtt_output_file)


    print(f"Data has been written to {output_dir} in CSV, TXT, JSON, and VTT formats")

# C. Function to process each audio file
def process_audio_file(audio_file, output_dir):
    try:
        print(f"Processing file: {audio_file}")
        # Load audio
        audio = whisperx.load_audio(audio_file)
        
        # Load and transcribe using WhisperX model
        model = whisperx.load_model(whisperx_model, device, compute_type=compute_type)
        result = model.transcribe(audio, batch_size=batch_size)
        print(result["segments"])  # before alignment

        # Clean up model from GPU if needed
        del model
        gc.collect()
        torch.cuda.empty_cache()

        # Align WhisperX output
        model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
        result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
        print(result["segments"])  # after alignment

        # Clean up alignment model from GPU if needed
        del model_a
        gc.collect()
        torch.cuda.empty_cache()

        # Diarization with WhisperX
        diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=device)
        diarize_segments = diarize_model(audio)
        result = whisperx.assign_word_speakers(diarize_segments, result)
        print(diarize_segments)
        print(result["segments"])  # segments are now assigned speaker IDs

        # Get file modification date
        date_str = get_file_modification_date(audio_file)
        
        # Save transcripts in multiple formats
        os.makedirs(output_dir, exist_ok=True)
        filename = os.path.splitext(os.path.basename(audio_file))[0]
        save_transcripts(result["segments"], output_dir, filename, date_str)

    except Exception as e:
        print(f"An error occurred while processing {audio_file}: {e}")

    finally:
        # Ensure that all models are cleaned from memory
        del diarize_model                                   # Clean up diarize_model
        del result                                          # Clean up result
        gc.collect()                                        # Garbage collection
        torch.cuda.empty_cache()                            # Empty cache

# D. Main function to execute the tasks
def main(base_dir, output_base_dir, extensions):
    audio_files = find_audio_files(base_dir, extensions)
    print(f"Found {len(audio_files)} audio files.")
    for audio_file in audio_files:
        relative_path = os.path.relpath(audio_file, base_dir)
        output_dir = os.path.join(output_base_dir, os.path.splitext(relative_path)[0])  # Create a unique output folder for each audio file
        process_audio_file(audio_file, output_dir)

# Define the file extensions to look for
extensions = [file_type1, file_type2, file_type3]

# E. Execute the main function
main(base_dir, output_base_dir, extensions)



Found audio file: Data/RawAudioFiles_Inputs\projectFiles\1\Monologue.ogg
Found audio file: Data/RawAudioFiles_Inputs\projectFiles\2\Monologue2.ogg
Found 2 audio files.
Processing file: Data/RawAudioFiles_Inputs\projectFiles\1\Monologue.ogg


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\mrhal\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
[{'text': ' the job. Glad to see things are going well and business is starting to pick up. Andrea told me about your outstanding numbers on Tuesday. Keep up the good work. Now to other business. I am gonna suggest a payment schedule', 'start': 1.613, 'end': 29.889}, {'text': ' for the outstanding monies that is due. One, can you pay the balance of the license agreement as soon as possible? Two, I suggest we set up or you suggest which you can pay on the back royalties', 'start': 32.619, 'end': 59.94}, {'text': ' what do you feel comfortable with paying every two weeks every month I would like to keep I would like to catch up', 'start': 62.619, 'end': 88.592}, {'text': ' and maintain current royalties so if we can start current royalties and m

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\mrhal\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
[{'text': ' the job. Glad to see things are going well and business is starting to pick up. Andrea told me about your outstanding numbers on Tuesday. Keep up the good work. Now to other business. I am gonna suggest a payment schedule', 'start': 1.613, 'end': 29.889}, {'text': ' for the outstanding monies that is due. One, can you pay the balance of the license agreement as soon as possible? Two, I suggest we set up or you suggest which you can pay on the back royalties', 'start': 32.619, 'end': 59.94}, {'text': ' what do you feel comfortable with paying every two weeks every month I would like to keep I would like to catch up', 'start': 62.619, 'end': 88.592}, {'text': ' and maintain current royalties so if we can start current royalties and m