<a href="https://colab.research.google.com/github/kutyadog/ai_notebooks/blob/main/Whisper_voices_cj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Google drive files transcribing with OPENAI Whisper

**Note: - ONLY WORKS WITH AUDIO THAT HAS MONO TYPE**

CJ LINKS:

https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing#scrollTo=jKG14DGYbwku


STEPS:

1/ Get Google Colab resources (https://colab.research.google.com/signup) [100 points is more than enough for 40+ transcriptions (needed for large-v2 modal)]

2/ Write file location path & File name (plus filetype) in variable 'path'.

3/ Set Number of speakers, Language & model

4/ Run code [under options 'Runtime', select 'Run all'.] [You can find 'Runtime' at the top of the screen, next to 'Insert' & 'Tools']

5/ Download .txt file from left menu option [the one with the folder icon] named; 'Files'.


How to videos:
  1. Getting started video [Not necessary to run code] : https://youtu.be/yVLhG4-7Sj4
  2. Audacity set MONO audio: https://www.youtube.com/watch?v=TTbBibBDGpg&ab_channel=LearnAudacity

*Note: This requires giving the application permission to connect to your drive. Only you will have access to the contents of your drive, but please read the warnings carefully.*

###**For faster performance set your runtime to "GPU"**
*Click on "Runtime" in the menu and click "Change runtime type". Select "GPU".*

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Write file drive location & filename below!
path = '/content/slip.wav'


Mounted at /content/drive


In [None]:
!pip install git+https://github.com/openai/whisper.git
!pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null
!sudo apt update && sudo apt install ffmpeg

In [None]:

import whisper
import datetime

import subprocess

import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding(
    "speechbrain/spkrec-ecapa-voxceleb",
    device=torch.device("cuda"))

from pyannote.audio import Audio
from pyannote.core import Segment

import wave
import contextlib

from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [None]:
# Set Number of speakers, Language & model

num_speakers = 4 #@param {type:"integer"}

language = "English" #@param ["any", "English", "Dutch"]

model_size = "large-v2" #@param ["tiny", "base", "small", "medium", "large", "large-v2"]


model_name = model_size
if language == 'English' and model_size != 'large':
  model_name += '.en'

In [None]:
# @title Convert m4a file to mono wav (if needed)
sourcePath = '/content/TypeScriptStandards-audio.m4a'
!ffmpeg -i {sourcePath} -acodec pcm_s16le -ac 1 -ar 16000 {path}

In [None]:
if path[-3:] != 'wav':
  subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
  path = 'audio.wav'
  model = whisper.load_model(model_size)

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
model = whisper.load_model(model_size)

100%|██████████████████████████████████████| 2.87G/2.87G [00:11<00:00, 266MiB/s]


In [None]:
result = model.transcribe(path)
segments = result["segments"]

In [None]:
with contextlib.closing(wave.open(path,'r')) as f:
  frames = f.getnframes()
  rate = f.getframerate()
  duration = frames / float(rate)

In [None]:
audio = Audio()

def segment_embedding(segment):
  start = segment["start"]
  # Whisper overshoots the end timestamp in the last segment
  end = min(duration, segment["end"])
  clip = Segment(start, end)
  waveform, sample_rate = audio.crop(path, clip)
  return embedding_model(waveform[None])

In [None]:
embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
  embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

In [None]:
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

In [None]:
def time(secs):
  return datetime.timedelta(seconds=round(secs))

f = open("transcript.txt", "w", encoding="utf-8")

for (i, segment) in enumerate(segments):
  if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
    f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
  f.write(segment["text"][1:] + ' ')
f.close()