<a href="https://colab.research.google.com/github/manuelarguelles/Speech-to-Text-Transcription-Tool-v2/blob/main/20240624_Speech(LongAudio_Estereo)_to_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Para este ejemplo se utilizó un audio de 3 horas de duración:

https://www.facebook.com/share/v/N1WrLDh6dr3c9Ppi/?mibextid=oFDknk


El archivo .mp3 lo pueden descargar de la siguiente ruta:


In [1]:
##Complemento para convertir de .mp3 a .wav
!ffmpeg -version

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-l

In [2]:
##Paquete de OpenAI
!pip install openai

!pip install pattern

!pip install -q git+https://github.com/openai/whisper.git > /dev/null
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null



In [3]:
import whisper
import torch
import pyannote.audio
import subprocess
import contextlib
import wave
import datetime
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Audio
from pyannote.core import Segment
import gc

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
import os

# Listar archivos en el directorio especificado
directory_path = '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/'
files = os.listdir(directory_path)
print(f"Archivos en el directorio {directory_path}:")
print(files)

Archivos en el directorio /content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/:
['Audio.mp3', 'Audio.wav']


In [5]:
import subprocess

def extract_audio_segment(input_path, start_time, end_time, output_path):
    """
    Extracts a segment from an audio file.

    :param input_path: Path to the input audio file.
    :param start_time: Start time of the segment in seconds.
    :param end_time: End time of the segment in seconds.
    :param output_path: Path to save the output audio segment.
    """
    try:
        command = [
            'ffmpeg',
            '-i', input_path,
            '-ss', str(start_time),
            '-to', str(end_time),
            '-c', 'copy',
            output_path,
            '-y'
        ]
        result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        if result.returncode != 0:
            print(f"Error during audio extraction: {result.stderr.decode()}")
        else:
            print(f"Audio segment saved to {output_path}")
    except subprocess.CalledProcessError as e:
        print(f"Error during audio extraction: {e}")


In [6]:
# Parámetros personalizables, en caso se quiera transcribir solo una parte de
# todo el audio.

input_audio_path = '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio.mp3'
start_time_seconds = 5 * 60  # Minuto 5
end_time_seconds = 15 * 60 + 9 # Minuto 15
output_audio_path = '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio_segment.mp3'

# Extraer el segmento de audio
extract_audio_segment(input_audio_path, start_time_seconds, end_time_seconds, output_audio_path)

Audio segment saved to /content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio_segment.mp3


In [7]:
# Función para transcribir

class AudioTranscriber:
    def __init__(self, path, num_speakers=1, language='Spanish', model_size='large'):
        self.path = path
        self.num_speakers = num_speakers
        self.language = language
        self.model_size = model_size
        self.audio = Audio()
        self.embedding_model = PretrainedSpeakerEmbedding(
            "speechbrain/spkrec-ecapa-voxceleb", device=torch.device("cuda" if torch.cuda.is_available() else "cpu"))
        self.model = whisper.load_model(model_size)
        if path[-3:] != 'wav':
            self.path = self.convert_to_wav(path)

    def convert_to_wav(self, path):
        filename = path.rsplit('.', 1)[0]
        new_path = f"{filename}.wav"
        subprocess.call(['ffmpeg', '-i', path, '-ac', '1', new_path, '-y'])  # Convert to mono
        return new_path

    def transcribe(self):
        result = self.model.transcribe(self.path)
        segments = result["segments"]
        with contextlib.closing(wave.open(self.path, 'r')) as f:
            frames = f.getnframes()
            rate = f.getframerate()
            duration = frames / float(rate)
        embeddings = self.get_embeddings(segments, duration)
        clustering = AgglomerativeClustering(self.num_speakers).fit(embeddings)
        labels = clustering.labels_
        for i in range(len(segments)):
            segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)
        return segments

    def get_embeddings(self, segments, duration):
        embeddings = np.zeros(shape=(len(segments), 192))
        for i, segment in enumerate(segments):
            embeddings[i] = self.segment_embedding(segment, duration)
        return np.nan_to_num(embeddings)

    def segment_embedding(self, segment, duration):
        start = segment["start"]
        end = min(duration, segment["end"])
        clip = Segment(start, end)
        waveform, sample_rate = self.audio.crop(self.path, clip)
        return self.embedding_model(waveform[None])

    def save_transcription(self, segments, output_filename):
        with open(output_filename, "w") as f:
            for (i, segment) in enumerate(segments):
                if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
                    f.write("\n" + segment["speaker"] + ' ' + str(datetime.timedelta(seconds=round(segment["start"]))) + '\n')
                f.write(segment["text"][1:] + ' ')


In [8]:
# Ejecutando para audio completo
transcriber = AudioTranscriber('/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio.mp3')
segments = transcriber.transcribe()
transcriber.save_transcription(segments, '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio.txt')

In [None]:
# Ejecutando por un segmento del audio
transcriber = AudioTranscriber('/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio_segment.mp3')
segments = transcriber.transcribe()
transcriber.save_transcription(segments, '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio_segment.txt')

In [None]:
# Liberación de memoria
del transcriber
del segments
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

In [11]:
# Visualizar transcripción

file_path = '/content/drive/MyDrive/SPEECH ANALYTICS/AUDIOS/AUDIENCIA/Audio.txt'

# Abrir y leer el archivo
with open(file_path, 'r') as file:
    content = file.read()

# Dividir el contenido en líneas de longitud fija (por ejemplo, 80 caracteres)
max_length = 120
lines = content.split('\n')
wrapped_content = []

for line in lines:
    while len(line) > max_length:
        wrapped_content.append(line[:max_length])
        line = line[max_length:]
    wrapped_content.append(line)

# Mostrar el contenido con líneas envueltas
for line in wrapped_content:
    print(line)




SPEAKER 1 0:00:00
Esucida, luz antes de ella, la moneda afligó mi color, que a los hijos ambos se les puede dar eso. Escribe, escribe, esc
ribe, escribe, escribe, escribe, escribe, escribe, escribe. A su sombra vivamos tranquilos, y al nacer por sus nubes del
 sol, regocemos en la cura del evento. Te bendimos, te bendimos, te bendimos, te bendimos, te bendimos, adiós, te cajo, 
te cajo, te bendimos, adiós, te cajo, adiós, te cajo. Somos libres, seamos, no siempre, seamos, no siempre, y aunque nie
gue su luz, su luz, su luz se casó, que faltemos a otro solemne, que la patria de tierra llegó, que faltemos. Somos libr
es, seamos, no siempre, seamos, no siempre, que la patria de tierra llegó, que faltemos a otro solemne, que la patria de
 tierra llegó. Viva el Perú! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! V
iva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! Viva! cantando hasta un milo emoción. Que hoy tus hijos de jubilo v
encidos canta