# Generating speech-to-text embeddings using WhisperX

**Objective**: To generate accurate transcriptions of audio recordings from several podcasts for later analysis using natural language processing (NLP). 

**How to get this notebook to work**:
1. First, you'll want to follow the setup instructions on the [WhisperX github page](https://github.com/m-bain/whisperX).
2. After you have created a conda environment and have all the dependencies installed, you'll also want to make sure that juypter notebook is installed in that environment. To do this, install Jupyter in the environment: `conda install ipykernel -c conda-forge` followed by:  `ipython kernel install --user --name=<envname>`
3. Now, you should be able to open jupyter notebook and see the environment name as a kernel environment you can select when you open a new notebook. 
4. Finally, be sure to update the device and compute type depending on your resources. If you're using a personal computer, it's likely that this model will cause your kernel to die. Cuda is typically specific to GPU computing, so if you're not using a GPU, I recommend picking smaller devices & computing power. 

In [1]:
import whisperx
import gc
import json

  from .autonotebook import tqdm as notebook_tqdm
torchvision is not available - cannot save figures


In [2]:
device = "cuda"
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

model = whisperx.load_model("large-v2", device, compute_type=compute_type)

No language specified, language will be first be detected for each audio file (increases inference time).


Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.6. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.0. Bad things might happen unless you revert torch to 1.x.


In [4]:
file_path = "/safestore/users/lindsey/Desktop/deathblart-nlp/data/"
file_list = ["2015_trimmed","2016","2017","2018","2019","2020","2021","2022"]

In [None]:
#2015 podcast needs first 30s to be trimmed due to music
from pydub import AudioSegment

podcast = AudioSegment.from_file(file_path+"Blart2015.mp3",format="mp3")

# pydub does things in miliseconds
thirty_seconds = 30 * 1000
trimmed_podcast = podcast[thirty_seconds:]
trimmed_podcast.export ("Blart2015_trimmed.mp3", format="mp3")

In [5]:
for audio_track in file_list:

    audio_file = file_path+"Blart"+audio_track+".mp3"
    batch_size = 16 # reduce if low on GPU mem
    
    print(audio_file)

    #Unaligned transcriptions
    audio = whisperx.load_audio(audio_file)
    result = model.transcribe(audio, batch_size=batch_size)
    
    #Aligned transcriptions
    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
    
    with open("Blart"+audio_track+".json","w") as write_file:
        json.dump(result["segments"],write_file)

/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2015.mp3
Detected language: pt (0.69) in first 30s of audio...


Downloading (…)rocessor_config.json: 100%|█████| 262/262 [00:00<00:00, 1.39MB/s]
Downloading (…)lve/main/config.json: 100%|█| 1.78k/1.78k [00:00<00:00, 12.2MB/s]
Downloading (…)olve/main/vocab.json: 100%|█████| 430/430 [00:00<00:00, 2.89MB/s]
Downloading (…)cial_tokens_map.json: 100%|████| 85.0/85.0 [00:00<00:00, 619kB/s]
Downloading pytorch_model.bin: 100%|███████| 1.26G/1.26G [00:22<00:00, 54.9MB/s]


/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2016.mp3
Detected language: en (0.59) in first 30s of audio...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2017.mp3
Detected language: en (0.99) in first 30s of audio...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2018.mp3
Detected language: en (0.53) in first 30s of audio...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2019.mp3
Detected language: en (0.98) in first 30s of audio...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2020.mp3
Detected language: en (0.99) in first 30s of audio...
Failed to align segment (" ♪♪"): no characters in this segment found in model dictionary, resorting to original...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2021.mp3
Detected language: en (1.00) in first 30s of audio...
/safestore/users/lindsey/Desktop/deathblart-nlp/data/Blart2022.mp3
Detected language: en (1.00) in first 30s of audio...
