# Chat Podcast

Author: Kenneth Leung

## 02. Whisper Transcription
- Use Whisper audio-to-text capabilities to transcribe MP3 audio files of podcasts

___
**Note:** Highly recommended to open and run this notebook in Colab (use GPU runtime) ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)

___

## (1) Mount Drive in Colab
- Faster way to get audio files accessible, as compared to uploading them to Colab

In [None]:
# Mount Google drive (since MP3 files are saved in Drive)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive/Data Vault/GitHub/Chat-Podcast

/content/drive/MyDrive/Data Vault/GitHub/Chat-Podcast


___
## (2) Install and Import Dependencies

In [None]:
# !pip install langchain
# !pip install openai
# !pip install -U openai-whisper
# !pip install python-dotenv
# !pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pinecone-client
  Downloading pinecone_client-2.2.1-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 KB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[2K     [90

In [49]:
import json
import os
import pandas as pd
import time
import torch
import whisper
import yaml
from dotenv import load_dotenv
from pathlib import Path
from langchain.embeddings.openai import OpenAIEmbeddings

___
## (3) Configuration Settings

In [None]:
torch. __version__

'1.13.1+cu116'

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [None]:
# Config settings
DEMO_PATH = 'demo'
AUDIO_PATH = 'audio'
TRANSCRIPT_PATH = 'transcripts'

In [None]:
os.environ['OPENAI_API_KEY'] = 'your_key_here'
# load_dotenv(dotenv_path='.env', verbose=True)

___
## (4) Initial Demo Run

In [None]:
whisper_model = whisper.load_model("medium.en").to(device)

In [None]:
text = whisper_model.transcribe(f"{DEMO_PATH}/Liam Neeson - Taken.mp3")
text['text']

" I don't know who you are. I don't know what you want. If you are looking for ransom, I can tell you I don't have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don't, I will look for you. I will find you. And I will kill you. Good luck."

___
## (5) Transcribe All Audio Files

In [None]:
# Load podcast metadata (generated from notebook 01)
metadata = pd.read_csv('podcast_metadata.csv')

In [None]:
paths = sorted([str(x) for x in Path(AUDIO_PATH).glob('*.mp3')])
paths

["audio/A Third Path to Talent Development - Delta's Michelle McCrackin.mp3",
 "audio/AI in Aerospace - Boeing's Helen Lee.mp3",
 "audio/AI in Your Living Room - Peloton's Sanjay Nichani.mp3",
 "audio/Big Data in Agriculture - Land O'Lakes' Teddy Bekele.mp3",
 "audio/Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott.mp3",
 "audio/Digital First, Physical Second - Wayfair's Fiona Tan.mp3",
 "audio/Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury.mp3",
 "audio/From Data to Wisdom - Novo Nordisk's Tonia Sideri.mp3",
 "audio/From Journalism to Jeans - Levi Strauss' Katia Walsh.mp3",
 "audio/Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer.mp3",
 "audio/Imagining Furniture (and the Future) with AI - IKEA Retail's Barbara Martin Coppola.mp3",
 "audio/Inventing the Beauty of the Future - L'Oreal's Stephane Lannuzel.mp3",
 "audio/Investing in the Last Mile - PayPal's Khatereh Khodavirdi.mp3",
 "audio/Keeping Humans in

In [None]:
# Save each transcript as JSON Line file
def save_transcript_json(content, title):
    with open(f"transcripts/{title}.jsonl", "w", encoding="utf-8") as fp:
        for line in content:
            json.dump(line, fp)
            fp.write('\n')

In [None]:
# Transcribe every MP3 file in audio folder
for i, path in enumerate(paths):
    episode_content = []

    # Get info of podcast episode
    title = path.split('/')[-1][:-4]

    # Skip if transcript already exists
    existing_transcripts = [str(x).split('/')[-1].split('.')[0] for x in \
                            Path(TRANSCRIPTS_PATH).glob('*')]
    if title in existing_transcripts:
        print(f'Transcript already exists for {title}. Skipping')
    else:
        date = metadata[metadata.Title == title]["Date"].values[0]
        url = metadata[metadata.Title == title]["URL"].values[0]
      
        # Initiate timer
        print(f'Begin transcription for {title}')
        start = time.time()

        # Transcribe MP3 audio with Whisper
        result = whisper_model.transcribe(path)
        segments = result['segments']

        for segment in segments:
            # Merge segments data and podcast metadata
            segment_content = {'title': title,
                               'date': date,
                               'url': url,
                               'id': f"{title}-t{segment['start']}",
                               'text': segment['text'].strip(),
                               'start': segment['start'],
                               'end': segment['end']}
            episode_content.append(segment_content)

        # Save contents as JSON
        save_transcript_json(episode_content, title)
      
        # Show time taken
        duration = time.time() -start
        print(f"{duration/60} minutes taken for episode: {title}")

Transcript already exists for A Third Path to Talent Development - Delta's Michelle McCrackin. Skipping
Transcript already exists for AI in Aerospace - Boeing's Helen Lee. Skipping
Begin transcription for AI in Your Living Room - Peloton's Sanjay Nichani
2.9641833583513897 minutes taken for episode: AI in Your Living Room - Peloton's Sanjay Nichani
Begin transcription for Big Data in Agriculture - Land O'Lakes' Teddy Bekele
3.1866719404856365 minutes taken for episode: Big Data in Agriculture - Land O'Lakes' Teddy Bekele
Begin transcription for Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott
4.17211240530014 minutes taken for episode: Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott
Begin transcription for Digital First, Physical Second - Wayfair's Fiona Tan
3.555857837200165 minutes taken for episode: Digital First, Physical Second - Wayfair's Fiona Tan
Begin transcription for Extreme Innovation with AI - Stanley Black and D