# WhisperX
* Enlace del paper: https://arxiv.org/pdf/2303.00747.pdf
* Enlace del repositorio de GitHub: https://github.com/m-bain/whisperX
* Enlace de HuggingFace para pyannote: https://huggingface.co/pyannote/speaker-diarization-3.1

In [45]:
import os
import whisper
import torch
import pandas as pd
from operator import itemgetter

# Audio de 10 minutos

In [28]:
import whisperx
import gc 

device = "cuda" 
audio_file = "1001764369_1.WAV"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type = compute_type, language = "es")

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size = batch_size, language = "es")
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code = result["language"], device = device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments = False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="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", device = device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio, num_speakers = 2)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs




  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\usuario\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.2+cu118. Bad things might happen unless you revert torch to 1.x.
[{'text': ' Hola. Sí, buenas tardes. Sí, con Juan Cartagena. Sí. Hola. Hola. ¿Cómo estás? Muy bien. Ay, qué bueno. Me alegro. Juan, vimos en nuestro sistema tu interés por asegurar tu moto con la policía a todo riesgo, con placas RKQ33F. ¿Es correcto? Sí. Listo, Juan. Y cuando ingresaste al sistema, ¿pudiste ver los valores? Hola.', 'start': 10.742, 'end': 38.063}, {'text': ' Cuando ingresaste al sistema, ¿pudiste ver los valores? Sí, más o menos. Listo, Juan. ¿Hay alguno que te llamara la atención? ¿Me puedes repetir los valores, por favor? Sí, claro que sí. Te pregunto, Juan. ¿Tú eres el que aparece en la tarjeta de propiedad? Sí. Listo. En un momento, entonces, para que podamos realizar la cotización. Te voy entonces a confirmar tu fecha de

In [29]:
# Cálculo de score promedio para cada segmento
scores = []
for segment in result["segments"]:
    # Filtrar solo las palabras que tienen un puntaje válido
    words_with_scores = [word for word in segment["words"] if "score" in word]
    # Calcular el puntaje promedio para las palabras filtradas
    if words_with_scores:
        avg_score = pd.DataFrame(words_with_scores)["score"].mean()
    else:
        avg_score = float("nan")  # Otra opción: avg_score = None
    scores.append(avg_score)

scores = pd.Series(scores, name = "avg_score")
scores

0      0.289000
1      0.495000
2      0.347750
3      0.217000
4      0.533000
         ...   
165    0.287000
166    0.325000
167    0.438364
168    0.247500
169    0.380750
Name: avg_score, Length: 170, dtype: float64

In [30]:
result["segments"]

[{'start': 10.742,
  'end': 10.922,
  'text': ' Hola.',
  'words': [{'word': 'Hola.',
    'start': 10.742,
    'end': 10.922,
    'score': 0.289,
    'speaker': 'SPEAKER_00'}],
  'speaker': 'SPEAKER_00'},
 {'start': 11.042,
  'end': 13.244,
  'text': 'Sí, buenas tardes.',
  'words': [{'word': 'Sí,',
    'start': 11.042,
    'end': 12.103,
    'score': 0.763,
    'speaker': 'SPEAKER_01'},
   {'word': 'buenas',
    'start': 12.123,
    'end': 12.343,
    'score': 0.504,
    'speaker': 'SPEAKER_01'},
   {'word': 'tardes.',
    'start': 13.064,
    'end': 13.244,
    'score': 0.218,
    'speaker': 'SPEAKER_01'}],
  'speaker': 'SPEAKER_01'},
 {'start': 13.264,
  'end': 14.105,
  'text': 'Sí, con Juan Cartagena.',
  'words': [{'word': 'Sí,',
    'start': 13.264,
    'end': 13.304,
    'score': 0.017,
    'speaker': 'SPEAKER_01'},
   {'word': 'con',
    'start': 13.324,
    'end': 13.384,
    'score': 0.172,
    'speaker': 'SPEAKER_01'},
   {'word': 'Juan',
    'start': 13.404,
    'end': 13.

In [31]:
# Dataframe de diarizacion
diarizacion = pd.DataFrame(result['segments'])

# Dataframe con el texto y el respectivo score promedio
transcription_df = pd.concat([diarizacion, scores], axis = 1)
transcription_df

Unnamed: 0,start,end,text,words,speaker,avg_score
0,10.742,10.922,Hola.,"[{'word': 'Hola.', 'start': 10.742, 'end': 10....",SPEAKER_00,0.289000
1,11.042,13.244,"Sí, buenas tardes.","[{'word': 'Sí,', 'start': 11.042, 'end': 12.10...",SPEAKER_01,0.495000
2,13.264,14.105,"Sí, con Juan Cartagena.","[{'word': 'Sí,', 'start': 13.264, 'end': 13.30...",SPEAKER_01,0.347750
3,17.988,18.248,Sí.,"[{'word': 'Sí.', 'start': 17.988, 'end': 18.24...",SPEAKER_01,0.217000
4,18.388,18.608,Hola.,"[{'word': 'Hola.', 'start': 18.388, 'end': 18....",SPEAKER_01,0.533000
...,...,...,...,...,...,...
165,592.873,593.073,Bueno.,"[{'word': 'Bueno.', 'start': 592.873, 'end': 5...",SPEAKER_00,0.287000
166,593.113,594.494,"Listo, hágale fuerza.","[{'word': 'Listo,', 'start': 593.113, 'end': 5...",SPEAKER_00,0.325000
167,594.514,596.896,"Bueno, Juan, ¿te recuerda que hablaste con Jul...","[{'word': 'Bueno,', 'start': 594.514, 'end': 5...",SPEAKER_01,0.438364
168,598.177,598.617,"Listo, gracias.","[{'word': 'Listo,', 'start': 598.177, 'end': 5...",SPEAKER_00,0.247500


In [32]:
# Timestamps
from operator import itemgetter
aux_iter = 0
for segment in result["segments"]:
    if "speaker" in segment:
        aux_info = itemgetter("start", "end", "text", "speaker")(segment)
        print(f"[{aux_info[0]} - {aux_info[1]}]  [{aux_info[3]}] [Score - {round(transcription_df['avg_score'][aux_iter], 4)}]: {aux_info[2]}")
    else:
        aux_info = itemgetter("start", "end", "text")(segment)
        print(f"[{aux_info[0]} - {aux_info[1]}] [Score - {round(transcription_df['avg_score'][aux_iter], 4)}]: {aux_info[2]}") 
    aux_iter += 1
print(f"Average score: {round(transcription_df.avg_score.mean(), 4)}, missing speakers: {transcription_df['speaker'].isna().sum()}, missing scores: {transcription_df['avg_score'].isna().sum()}")

[10.742 - 10.922]  [SPEAKER_00] [Score - 0.289]:  Hola.
[11.042 - 13.244]  [SPEAKER_01] [Score - 0.495]: Sí, buenas tardes.
[13.264 - 14.105]  [SPEAKER_01] [Score - 0.3478]: Sí, con Juan Cartagena.
[17.988 - 18.248]  [SPEAKER_01] [Score - 0.217]: Sí.
[18.388 - 18.608]  [SPEAKER_01] [Score - 0.533]: Hola.
[18.628 - 18.808]  [SPEAKER_01] [Score - 0.582]: Hola.
[18.848 - 19.369]  [SPEAKER_01] [Score - 0.511]: ¿Cómo estás?
[20.329 - 22.191]  [SPEAKER_00] [Score - 0.3275]: Muy bien.
[22.231 - 22.831]  [SPEAKER_01] [Score - 0.539]: Ay, qué bueno.
[22.871 - 23.512]  [SPEAKER_01] [Score - 0.7635]: Me alegro.
[24.012 - 30.357]  [SPEAKER_01] [Score - 0.6444]: Juan, vimos en nuestro sistema tu interés por asegurar tu moto con la policía a todo riesgo, con placas RKQ33F.
[30.537 - 31.898]  [SPEAKER_01] [Score - 0.885]: ¿Es correcto?
[32.599 - 34.02]  [SPEAKER_00] [Score - 0.756]: Sí.
[34.06 - 34.4]  [SPEAKER_01] [Score - 0.582]: Listo, Juan.
[34.44 - 36.502]  [SPEAKER_01] [Score - 0.5882]: Y cuand

# Generalización del ejercicio

In [27]:
# Lista de nombres de variables locales
# variables_locales = list(locals().keys())

# Eliminar objetos locales excepto las funciones y librerías importadas
# for nombre in variables_locales:
#    objeto = locals()[nombre]
#    if not callable(objeto) and not hasattr(objeto, '__module__'):
#        del locals()[nombre]

In [26]:
# Guardando ejercicio previo antes de liberar recursos
# Lista de objetos que deseas conservar
# objetos_a_conservar = ["transcription_df", "pd", "os", "gc", "whisper", "whisperx", "itemgetter"]

# Crear un diccionario de los objetos que deseas conservar
# objetos_conservados = {nombre: globals()[nombre] for nombre in objetos_a_conservar}

# Limpiar el entorno global
# for nombre in list(globals().keys()):
#    if nombre not in objetos_a_conservar:
#        del globals()[nombre]

# Actualizar el entorno global con los objetos conservados
# globals().update(objetos_conservados)

In [38]:
def transcription(audio_file, token, language = "es", num_speakers = 2, device = "cuda", batch_size = 16, compute_type = "float16"):
    """Argumentos
    audio_file :  Archivo que va a ser transcrito (incluyendo la extension).
    language: Idioma del audio.
    num_speakers: Numero de personas que hablan en el audio.
    token: HugginFace token
    """
    # INICIO FLUJO DE TRABAJO DE WHISPERX (EN GITHUB)
    import whisperx
    import gc 
    # ------------------------------------------------------------------------------------------
    # SE ESPECIFICA EN EL ARGUMENTO DE LA FUNCION
    # device = "cuda" 
    # audio_file = audio_file
    # batch_size = 16 # reduce if low on GPU mem
    # compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
    # ------------------------------------------------------------------------------------------

    # 1. Transcribe with original whisper (batched)
    model = whisperx.load_model("large-v2", device, compute_type = compute_type, language = language)

    # save model to local path (optional)
    # model_dir = "/path/"
    # model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

    audio = whisperx.load_audio(audio_file)
    result = model.transcribe(audio, batch_size = batch_size, language = language)

    # delete model if low on GPU resources
    import gc; gc.collect(); torch.cuda.empty_cache(); del model

    # 2. Align whisper output
    model_a, metadata = whisperx.load_align_model(language_code = result["language"], device = device)
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments = False)

    # delete model if low on GPU resources
    import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

    # 3. Assign speaker labels
    # AQUÍ SE DEBE COLOCAR EL TOKEN DE AUTENTICACIÓN GENERADO EN HUGGINGFACE
    diarize_model = whisperx.DiarizationPipeline(use_auth_token = token, device = device)

    # add min/max number of speakers if known
    diarize_segments = diarize_model(audio, num_speakers = num_speakers)
    # diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

    result = whisperx.assign_word_speakers(diarize_segments, result)
    # FIN FLUJO DE TRABAJO DE WHISPERX
    # -------------------------------------------------------------------------------------------------------
    
    # CÁLCULO DE SCORE PROMEDIO PARA CADA SEGMENTO
    scores = []
    for segment in result["segments"]:
        # Filtrar solo las palabras que tienen un puntaje válido
        words_with_scores = [word for word in segment["words"] if "score" in word]
        # Calcular el puntaje promedio para las palabras filtradas
        if words_with_scores:
            avg_score = pd.DataFrame(words_with_scores)["score"].mean()
        else:
            avg_score = float("nan")  # Otra opción: avg_score = None
        scores.append(avg_score)

        scores = pd.Series(scores, name = "avg_score")
    
    # DATAFRAME DE  DIARIZACION
    diarizacion = pd.DataFrame(result['segments'])

    # DATAFRAME DE DIARIZACION Y SCORE PROMEDIO
    transcription_df = pd.concat([diarizacion, scores], axis = 1)
    return [result, transcription_df]

# Otros audios

In [41]:
["1014249230_3.WAV", "1129574339_6.WAV", "14838701_2.WAV"]

['1014249230_3.WAV', '1129574339_6.WAV', '14838701_2.WAV']

In [46]:
result_com = transcription("14838701_2.WAV", token = "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", batch_size = 16)

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\usuario\.cache\torch\whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.2+cu118. Bad things might happen unless you revert torch to 1.x.
