# Chat Podcast

Author: Kenneth Leung

## 03B. FAISS Vectorstore
- Use FAISS to build vectorstores of transcripts
- https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/faiss.py

___
## (1) Install and Import Dependencies

In [1]:
import json
import os
import pandas as pd
import yaml
from dotenv import load_dotenv
from pathlib import Path
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

___
## (2) Configuration Settings

In [2]:
os.environ['OPENAI_API_KEY'] = 'your_key_here'

In [3]:
# Config settings
AUDIO_PATH = '../audio'
TRANSCRIPT_PATH = '../transcripts'

___
## (3) Processing of Transcripts

In [68]:
# View all transcribed files
transcripts = sorted([str(x) for x in Path(TRANSCRIPT_PATH).glob('*.jsonl')])
transcripts

["..\\transcripts\\A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen.jsonl",
 "..\\transcripts\\AI in Aerospace - Boeing's Helen Lee.jsonl",
 "..\\transcripts\\AI in Your Living Room - Peloton's Sanjay Nichani.jsonl",
 "..\\transcripts\\AI in the Supply Chain - Cold Chain Technologies' Ranjeet Banerjee.jsonl",
 "..\\transcripts\\Big Data in Agriculture - Land O'Lakes' Teddy Bekele.jsonl",
 "..\\transcripts\\Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott.jsonl",
 "..\\transcripts\\Digital First, Physical Second - Wayfair's Fiona Tan.jsonl",
 "..\\transcripts\\Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury.jsonl",
 "..\\transcripts\\From Data to Wisdom - Novo Nordisk's Tonia Sideri.jsonl",
 "..\\transcripts\\From Journalism to Jeans - Levi Strauss' Katia Walsh.jsonl",
 "..\\transcripts\\Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer.jsonl",
 "..\\transcripts\\Imagining Furniture (and th

In [69]:
lines = []

# Combine all JSONL files together
for transcript in transcripts:
    with open(transcript, "r", encoding="utf-8") as fp:
        for line in fp:
            line = json.loads(line) # Convert string dictionary to dict
            lines.append(line)

In [70]:
print(len(lines))

7682


In [71]:
lines[6]

{'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen",
 'date': '2023-03-28 07:00:00+00:00',
 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo',
 'id': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen-t23.240000000000002",
 'text': 'Welcome to Me, Myself and AI, a podcast on artificial intelligence and business.',
 'start': 23.240000000000002,
 'end': 28.34}

In [72]:
# Check text in every segment
for chunk in lines[5:8]:
    print(chunk['text'])

and AI.
Welcome to Me, Myself and AI, a podcast on artificial intelligence and business.
Each episode, we introduce you to someone innovating with AI.


___
## (4) Extend Segment Texts
- We do not want each segment to be only one phrase/sentence long
- To make the indexing more useful and logical, we combine the texts of multiple segments together

In [73]:
# Chunking and striding
new_segments = []

chunk_size = 6  # No. of segment texts to combine
chunk_overlap = 3  # No. of segment texts to overlap

for i in range(0, len(lines), chunk_overlap):
    i_end = min(len(lines)-1, i + chunk_size)
    if lines[i]['title'] != lines[i_end]['title']:
        # Skip if audio file names are same
        continue
    text_list = []
    for chunk in lines[i:i_end]:
        text_list.append(chunk['text'])
    text = ' '.join(text_list)
    new_segments.append({
        'start': lines[i]['start'],
        'end': lines[i_end]['end'],
        'title': lines[i]['title'],
        'text': text,
        'id': lines[i]['id'],
        'url': lines[i]['url'],
        'date': lines[i]['date']
    })

In [74]:
len(new_segments)

2517

In [75]:
new_segments[1636]

{'start': 171.04,
 'end': 191.76,
 'title': "Out of the Lab and Into a Product - Microsoft's Eric Boyd",
 'text': "components and delivering them. How do you know what to build? How do you tell people how to use them? How does this work? How does this infrastructure and ecosystem start to play out? Yeah. I mean, we're pretty privileged at Microsoft to have a whole bunch of different businesses that we've been in for a while.",
 'id': "Out of the Lab and Into a Product - Microsoft's Eric Boyd-t171.04",
 'url': 'https://open.spotify.com/episode/5XDcsbVuaQGkhs5m5Y4RfE',
 'date': 'Feb-23'}

___
## (5) Setup Vectorstore with FAISS

In [76]:
embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

In [77]:
new_segments[0]

{'start': 0.0,
 'end': 28.34,
 'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen",
 'text': "Our guests often use Lego as an analogy for how organizations can build up solutions with data. But today, find out how Lego itself builds data components that connect as easily as it breaks. I'm Anders Putschbard-Kressensson from the Lego Group and you're listening to Me, Myself and AI.",
 'id': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen-t0.0",
 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo',
 'date': '2023-03-28 07:00:00+00:00'}

In [78]:
# Convert segments into three lists for vectorstore upsert
texts = [elem['text'] for elem in new_segments]
ids = [elem['id'] for elem in new_segments]
metadatas = [{
#             "text": elem["text"],
            "start": elem["start"],
            "end": elem["end"],
            "url": elem["url"],
            "date": elem["date"],
            "title": elem["title"]
            } for elem in new_segments]

##### Note that OpenAI Embeddings have token limit of ~8000

In [79]:
docsearch = FAISS.from_texts(texts=texts, 
                             embedding=embeddings, 
                             metadatas=metadatas)

In [None]:
docsearch.save_local(f'{AUDIO_PATH}/vectorstore')

In [29]:
docsearch = FAISS.load_local(f'{AUDIO_PATH}/vectorstore', embeddings)

___
## (6) Check using Vector Similarity Search

In [30]:
query = "Which guest was invited to talk about the airline industry?"
docs = docsearch.similarity_search(query)

In [31]:
print(docs[0].page_content)

Shervin are excited to be talking today with Helen Li, Regional Director of Air Traffic Management and Airport Programs in China for the Boeing Company. Helen, thanks for taking the time to talk with us. Welcome. Thank you for having me. Let's get started. Helen, can you tell us about your current role at Boeing? I currently work at Boeing China in the Beijing office.
