# Chat Podcast

Author: Kenneth Leung

## 02. Whisper Transcription and Pinecone Build
- Use Whisper audio-to-text capabilities to transcribe MP3 audio files of podcasts
- Use Pinecone to build vectorstores of transcripts

___
**Note:** Highly recommended to open and run this notebook in Colab (use GPU runtime) ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)

___

## (1) Mount Drive in Colab
- Faster way to get audio files accessible, as compared to uploading them to Colab

In [None]:
# Mount Google drive (since MP3 files are saved in Drive)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd drive/MyDrive/Data Vault/GitHub/Chat-Podcast

/content/drive/MyDrive/Data Vault/GitHub/Chat-Podcast


___
## (2) Install and Import Dependencies

In [None]:
# !pip install langchain
# !pip install openai
# !pip install -U openai-whisper
# !pip install python-dotenv
# !pip install pinecone-client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pinecone-client
  Downloading pinecone_client-2.2.1-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 KB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[2K     [90

In [49]:
import json
import os
import pandas as pd
import pinecone
import time
import torch
import whisper
import yaml
from dotenv import load_dotenv
from pathlib import Path
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone

___
## (3) Configuration Settings

In [None]:
torch. __version__

'1.13.1+cu116'

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [None]:
# Config settings
DEMO_PATH = 'demo'
AUDIO_PATH = 'audio'
TRANSCRIPT_PATH = 'transcripts'

___
## (4) Initial Demo Run

In [None]:
whisper_model = whisper.load_model("medium.en").to(device)

In [None]:
text = whisper_model.transcribe(f"{DEMO_PATH}/Liam Neeson - Taken.mp3")
text['text']

" I don't know who you are. I don't know what you want. If you are looking for ransom, I can tell you I don't have money. But what I do have are a very particular set of skills. Skills I have acquired over a very long career. Skills that make me a nightmare for people like you. If you let my daughter go now, that will be the end of it. I will not look for you. I will not pursue you. But if you don't, I will look for you. I will find you. And I will kill you. Good luck."

___
## (5) Transcribe All Audio Files

In [None]:
# Load podcast metadata (generated from notebook 01)
metadata = pd.read_csv('podcast_metadata.csv')

In [None]:
paths = sorted([str(x) for x in Path(AUDIO_PATH).glob('*.mp3')])
paths

["audio/A Third Path to Talent Development - Delta's Michelle McCrackin.mp3",
 "audio/AI in Aerospace - Boeing's Helen Lee.mp3",
 "audio/AI in Your Living Room - Peloton's Sanjay Nichani.mp3",
 "audio/Big Data in Agriculture - Land O'Lakes' Teddy Bekele.mp3",
 "audio/Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott.mp3",
 "audio/Digital First, Physical Second - Wayfair's Fiona Tan.mp3",
 "audio/Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury.mp3",
 "audio/From Data to Wisdom - Novo Nordisk's Tonia Sideri.mp3",
 "audio/From Journalism to Jeans - Levi Strauss' Katia Walsh.mp3",
 "audio/Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer.mp3",
 "audio/Imagining Furniture (and the Future) with AI - IKEA Retail's Barbara Martin Coppola.mp3",
 "audio/Inventing the Beauty of the Future - L'Oreal's Stephane Lannuzel.mp3",
 "audio/Investing in the Last Mile - PayPal's Khatereh Khodavirdi.mp3",
 "audio/Keeping Humans in

In [None]:
# Save each transcript as JSON Line file
def save_transcript_json(content, title):
    with open(f"transcripts/{title}.jsonl", "w", encoding="utf-8") as fp:
        for line in content:
            json.dump(line, fp)
            fp.write('\n')

In [None]:
# Transcribe every MP3 file in audio folder
for i, path in enumerate(paths):
    episode_content = []

    # Get info of podcast episode
    title = path.split('/')[-1][:-4]

    # Skip if transcript already exists
    existing_transcripts = [str(x).split('/')[-1].split('.')[0] for x in \
                            Path(TRANSCRIPTS_PATH).glob('*')]
    if title in existing_transcripts:
        print(f'Transcript already exists for {title}. Skipping')
    else:
        date = metadata[metadata.Title == title]["Date"].values[0]
        url = metadata[metadata.Title == title]["URL"].values[0]
      
        # Initiate timer
        print(f'Begin transcription for {title}')
        start = time.time()

        # Transcribe MP3 audio with Whisper
        result = whisper_model.transcribe(path)
        segments = result['segments']

        for segment in segments:
            # Merge segments data and podcast metadata
            segment_content = {'title': title,
                               'date': date,
                               'url': url,
                               'id': f"{title}-t{segment['start']}",
                               'text': segment['text'].strip(),
                               'start': segment['start'],
                               'end': segment['end']}
            episode_content.append(segment_content)

        # Save contents as JSON
        save_transcript_json(episode_content, title)
      
        # Show time taken
        duration = time.time() -start
        print(f"{duration/60} minutes taken for episode: {title}")

Transcript already exists for A Third Path to Talent Development - Delta's Michelle McCrackin. Skipping
Transcript already exists for AI in Aerospace - Boeing's Helen Lee. Skipping
Begin transcription for AI in Your Living Room - Peloton's Sanjay Nichani
2.9641833583513897 minutes taken for episode: AI in Your Living Room - Peloton's Sanjay Nichani
Begin transcription for Big Data in Agriculture - Land O'Lakes' Teddy Bekele
3.1866719404856365 minutes taken for episode: Big Data in Agriculture - Land O'Lakes' Teddy Bekele
Begin transcription for Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott
4.17211240530014 minutes taken for episode: Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott
Begin transcription for Digital First, Physical Second - Wayfair's Fiona Tan
3.555857837200165 minutes taken for episode: Digital First, Physical Second - Wayfair's Fiona Tan
Begin transcription for Extreme Innovation with AI - Stanley Black and D

___
## (6) Post-Processing of Transcripts

In [None]:
# View all transcribed files
transcripts = sorted([str(x) for x in Path(TRANSCRIPT_PATH).glob('*.jsonl')])
transcripts

["transcripts/A Third Path to Talent Development - Delta's Michelle McCrackin.jsonl",
 "transcripts/AI in Aerospace - Boeing's Helen Lee.jsonl",
 "transcripts/AI in Your Living Room - Peloton's Sanjay Nichani.jsonl",
 "transcripts/Big Data in Agriculture - Land O'Lakes' Teddy Bekele.jsonl",
 "transcripts/Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott.jsonl",
 "transcripts/Digital First, Physical Second - Wayfair's Fiona Tan.jsonl",
 "transcripts/Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury.jsonl",
 "transcripts/From Data to Wisdom - Novo Nordisk's Tonia Sideri.jsonl",
 "transcripts/From Journalism to Jeans - Levi Strauss' Katia Walsh.jsonl",
 "transcripts/Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer.jsonl",
 "transcripts/Imagining Furniture (and the Future) with AI - IKEA Retail's Barbara Martin Coppola.jsonl",
 "transcripts/Inventing the Beauty of the Future - L'Oreal's Stephane Lannuzel.jsonl",
 

In [None]:
lines = []

# Combine all JSONL files together
for transcript in transcripts:
    with open(transcript, "r", encoding="utf-8") as fp:
        for line in fp:
            line = json.loads(line) # Convert string dictionary to dict
            lines.append(line)

In [None]:
print(len(lines))

7152


In [None]:
lines[6]

{'title': "A Third Path to Talent Development - Delta's Michelle McCrackin",
 'date': 'Mar-23',
 'url': 'https://open.spotify.com/episode/50oRprIC6z0wJkpfLFQHDi',
 'id': "A Third Path to Talent Development - Delta's Michelle McCrackin-t32.56",
 'text': "I'm also the AI and Business Strategy guest editor at MIT Sloan Management Review.",
 'start': 32.56,
 'end': 38.019999999999996}

In [None]:
# Check text in every segment
for chunk in lines[5:8]:
    print(chunk['text'])

I'm Sam Ransbotham, Professor of Analytics at Boston College.
I'm also the AI and Business Strategy guest editor at MIT Sloan Management Review.
And I'm Shervin Kottubande, senior partner with BCG and one of the leaders of our AI business.


___
## (7) Extend Segment Texts
- We do not want each segment to be only one phrase/sentence long
- To make the indexing more useful and logical, we combine the texts of multiple segments together

In [None]:
# Chunking and striding
new_segments = []

chunk_size = 6  # No. of segment texts to combine
chunk_overlap = 3  # No. of segment texts to overlap

for i in range(0, len(lines), chunk_overlap):
    i_end = min(len(lines)-1, i + chunk_size)
    if lines[i]['title'] != lines[i_end]['title']:
        # Skip if audio file names are same
        continue
    text_list = []
    for chunk in lines[i:i_end]:
        text_list.append(chunk['text'])
    text = ' '.join(text_list)
    new_segments.append({
        'start': lines[i]['start'],
        'end': lines[i_end]['end'],
        'title': lines[i]['title'],
        'text': text,
        'id': lines[i]['id'],
        'url': lines[i]['url'],
        'date': lines[i]['date']
    })

In [None]:
len(new_segments)

2342

In [None]:
new_segments[0]

{'start': 0.0,
 'end': 38.019999999999996,
 'title': "A Third Path to Talent Development - Delta's Michelle McCrackin",
 'text': "How can organizations take advantage of existing deep domain knowledge? Find out how one airline is upscaling its frontline workforce on today's episode. I'm Michelle McCracken from Delta Airlines and you're listening to Me, Myself and AI. Welcome to Me, Myself and AI, a podcast on artificial intelligence and business. Each episode we introduce you to someone innovating with AI. I'm Sam Ransbotham, Professor of Analytics at Boston College.",
 'id': "A Third Path to Talent Development - Delta's Michelle McCrackin-t0.0",
 'url': 'https://open.spotify.com/episode/50oRprIC6z0wJkpfLFQHDi',
 'date': 'Mar-23'}

___
## (8) Setup Vectorstore with Pinecone

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

In [None]:
# Initialize pinecone instance
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENV'])

index_name = "chat-podcast"

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        1536, # Dimensions of OpenAI embeddings
        metric="cosine"
    )

index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [None]:
new_segments[0]

{'start': 0.0,
 'end': 38.019999999999996,
 'title': "A Third Path to Talent Development - Delta's Michelle McCrackin",
 'text': "How can organizations take advantage of existing deep domain knowledge? Find out how one airline is upscaling its frontline workforce on today's episode. I'm Michelle McCracken from Delta Airlines and you're listening to Me, Myself and AI. Welcome to Me, Myself and AI, a podcast on artificial intelligence and business. Each episode we introduce you to someone innovating with AI. I'm Sam Ransbotham, Professor of Analytics at Boston College.",
 'id': "A Third Path to Talent Development - Delta's Michelle McCrackin-t0.0",
 'url': 'https://open.spotify.com/episode/50oRprIC6z0wJkpfLFQHDi',
 'date': 'Mar-23'}

In [None]:
# Convert segments into three lists for vectorstore upsert
texts = [elem['text'] for elem in new_segments]
ids = [elem['id'] for elem in new_segments]
metadatas = [{
            "text": elem["text"],
            "start": elem["start"],
            "end": elem["end"],
            "url": elem["url"],
            "date": elem["date"],
            "title": elem["title"]
            } for elem in new_segments]

In [None]:
docsearch = Pinecone.from_texts(texts=texts, 
                                embedding=embeddings, 
                                metadatas=metadatas,
                                ids=ids,
                                index_name=index_name)

In [None]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2342}},
 'total_vector_count': 2342}

___
## (9) Vector Similarity Search

In [None]:
query = "Which guest was invited to talk about the airline industry?"
docs = docsearch.similarity_search(query)

In [None]:
print(docs[0].page_content)

Shervin are excited to be talking today with Helen Li, Regional Director of Air Traffic Management and Airport Programs in China for the Boeing Company. Helen, thanks for taking the time to talk with us. Welcome. Thank you for having me. Let's get started. Helen, can you tell us about your current role at Boeing? I currently work at Boeing China in the Beijing office.


In [None]:
# References
# https://huggingface.co/openai/whisper-medium
# https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/pinecone.py
# https://www.pinecone.io/learn/openai-whisper/