# Chat Podcast

Author: Kenneth Leung

## 03B. FAISS Vectorstore
- Use FAISS to build vectorstores of transcripts
- https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/faiss.py

___
## (1) Install and Import Dependencies

In [1]:
import json
import os
import pandas as pd
import yaml
from dotenv import load_dotenv
from pathlib import Path
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

___
## (2) Configuration Settings

In [2]:
os.environ['OPENAI_API_KEY'] = 'your_key_here'

In [84]:
# Config settings
TRANSCRIPT_PATH = '../transcripts'
VECTORSTORE_PATH = '../vectorstore'

___
## (3) Processing of Transcripts

In [90]:
# View all transcribed files
transcripts = sorted([str(x) for x in Path(TRANSCRIPT_PATH).glob('*.jsonl')])
transcripts

["..\\transcripts\\A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen.jsonl",
 "..\\transcripts\\A Third Path to Talent Development - Delta's Michelle McCrackin.jsonl",
 "..\\transcripts\\AI in Aerospace - Boeing's Helen Lee.jsonl",
 "..\\transcripts\\AI in Your Living Room - Peloton's Sanjay Nichani.jsonl",
 "..\\transcripts\\Big Data in Agriculture - Land O'Lakes' Teddy Bekele.jsonl",
 "..\\transcripts\\Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott.jsonl",
 "..\\transcripts\\Digital First, Physical Second - Wayfair's Fiona Tan.jsonl",
 "..\\transcripts\\Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury.jsonl",
 "..\\transcripts\\From Data to Wisdom - Novo Nordisk's Tonia Sideri.jsonl",
 "..\\transcripts\\From Journalism to Jeans - Levi Strauss' Katia Walsh.jsonl",
 "..\\transcripts\\Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer.jsonl",
 "..\\transcripts\\Imagining Furniture (and the F

In [87]:
transcripts[0].split('\\')[-1]

"A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen.jsonl"

In [69]:
lines = []

# Combine all JSONL files together
for transcript in transcripts:
    with open(transcript, "r", encoding="utf-8") as fp:
        for line in fp:
            line = json.loads(line) # Convert string dictionary to dict
            lines.append(line)

In [70]:
print(len(lines))

7682


In [71]:
lines[6]

{'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen",
 'date': '2023-03-28 07:00:00+00:00',
 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo',
 'id': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen-t23.240000000000002",
 'text': 'Welcome to Me, Myself and AI, a podcast on artificial intelligence and business.',
 'start': 23.240000000000002,
 'end': 28.34}

In [72]:
# Check text in every segment
for chunk in lines[5:8]:
    print(chunk['text'])

and AI.
Welcome to Me, Myself and AI, a podcast on artificial intelligence and business.
Each episode, we introduce you to someone innovating with AI.


___
## (4) Extend Segment Texts
- We do not want each segment to be only one phrase/sentence long
- To make the indexing more useful and logical, we combine the texts of multiple segments together
- Chunk size = No. of segment texts to combine
- Chunk overlap = No. of segment texts to overlap

In [131]:
def generate_new_segments(lines, chunk_size, chunk_overlap):
    new_segments = []

    for i in range(0, len(lines), chunk_overlap):
        i_end = min(len(lines)-1, i + chunk_size)
        if lines[i]['title'] != lines[i_end]['title']:
            # Skip if audio file names are same
            continue
        text_list = []
        for chunk in lines[i:i_end]:
            text_list.append(chunk['text'])
        text = ' '.join(text_list)
        new_segments.append({
            'start': lines[i]['start'],
            'end': lines[i_end]['end'],
            'title': lines[i]['title'],
            'text': text,
            'id': lines[i]['id'],
            'url': lines[i]['url'],
            'date': lines[i]['date']
        })
        
    return new_segments

In [132]:
new_segments = generate_new_segments(lines, 6, 3)

In [133]:
len(new_segments)

2517

In [75]:
new_segments[1636]

{'start': 171.04,
 'end': 191.76,
 'title': "Out of the Lab and Into a Product - Microsoft's Eric Boyd",
 'text': "components and delivering them. How do you know what to build? How do you tell people how to use them? How does this work? How does this infrastructure and ecosystem start to play out? Yeah. I mean, we're pretty privileged at Microsoft to have a whole bunch of different businesses that we've been in for a while.",
 'id': "Out of the Lab and Into a Product - Microsoft's Eric Boyd-t171.04",
 'url': 'https://open.spotify.com/episode/5XDcsbVuaQGkhs5m5Y4RfE',
 'date': 'Feb-23'}

___
## (5) Setup Vectorstore with FAISS

In [77]:
new_segments[0]

{'start': 0.0,
 'end': 28.34,
 'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen",
 'text': "Our guests often use Lego as an analogy for how organizations can build up solutions with data. But today, find out how Lego itself builds data components that connect as easily as it breaks. I'm Anders Putschbard-Kressensson from the Lego Group and you're listening to Me, Myself and AI.",
 'id': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen-t0.0",
 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo',
 'date': '2023-03-28 07:00:00+00:00'}

In [144]:
# Generate joined text and metadata for downstream vectorstore build
def get_texts_and_metadata(segments: list):
    texts = [elem['text'] for elem in segments]
    metadatas = [{
                "start": elem["start"],
                "end": elem["end"],
                "url": elem["url"],
                "date": elem["date"],
                "title": elem["title"],
                "id": elem["id"]
                } for elem in segments]
    
    return texts, metadatas

In [145]:
texts, metadatas = get_texts_and_metadata(new_segments)

##### Note that OpenAI Embeddings have token limit of ~8000

In [116]:
embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])

In [79]:
docsearch = FAISS.from_texts(texts=texts, 
                             embedding=embeddings, 
                             metadatas=metadatas)

In [85]:
# Create vectorstore for ALL podcasts combined
docsearch.save_local(f'{VECTORSTORE_PATH}/all_podcasts')

In [86]:
docsearch = FAISS.load_local(f'{VECTORSTORE_PATH}/all_podcasts', embeddings)

___
## (6) Check using Vector Similarity Search

In [82]:
query = "Which guest was invited to talk about the airline industry?"
docs = docsearch.similarity_search(query)

In [83]:
print(docs[0].page_content)

And I think people will enjoy hearing about that. Thank you for taking the time to talk with us today. Always glad to talk with you and thanks for having me on. Thanks for tuning in. Please join us next time when Shervin and I meet Michelle McCracken, who's helping Delta Airlines teach frontline employees about analytics and AI.


___
## (7) Single Vectorstore Builds
- Build one vectorstore for each podcast episode
- This is to allow customizable mix and match vectorstore later on
- Need to remove empty docs to avoid InvalidRequestError: [''] is not valid under any of the given schemas - 'input'

In [166]:
def vectorstore_build(episode_path):
    embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
    episode_name = episode_path.split('\\')[-1].split('.')[0]
    lines = []

    with open(episode_path, "r", encoding="utf-8") as fp:
        for line in fp:
            line = json.loads(line) # Convert string dictionary to dict
            lines.append(line)
            
    # Chunking and striding
    segments = generate_new_segments(lines, 6, 3)
    
    # Remove empty docs
    segments = [d for d in segments if d.get('text', '') != '']
    
    texts, metadatas = get_texts_and_metadata(segments)
    
    # Create vectorstore
    docsearch = FAISS.from_texts(texts=texts, 
                                 embedding=embeddings, 
                                 metadatas=metadatas)
    
    docsearch.save_local(f'{VECTORSTORE_PATH}/{episode_name}')

In [168]:
# Build vectorstore for each episode
for transcript in transcripts:
    vectorstore_build(transcript)

In [157]:
# def vectorstore_build(episode_path, batch_size):
#     embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
#     episode_name = episode_path.split('\\')[-1].split('.')[0]
#     lines = []

#     with open(episode_path, "r", encoding="utf-8") as fp:
#         for line in fp:
#             line = json.loads(line) # Convert string dictionary to dict
#             lines.append(line)
            
#     # Chunking and striding
#     segments = generate_new_segments(lines, 6, 3)
    
#     return segments

# #     # In case text goes too long, break up new segments into chunks
# #     print(episode_name, len(segments))

# #     # Create vectorstore
# #     for i in range(0, len(segments), batch_size):
# #         segment_batch = segments[i:i+batch_size]
# #         texts, metadatas = get_texts_and_metadata(segment_batch)
# #         if i == 0:
# #             print(f'Processing first batch for {episode_name}')
# #             docsearch = FAISS.from_texts(texts=texts, 
# #                                          embedding=embeddings, 
# #                                          metadatas=metadatas)
# # #             docsearch.save_local(f'{VECTORSTORE_PATH}/{episode_name}')
        
# #         else:
# #             print(f'Processing next batch for {episode_name}')
# # #             docsearch = FAISS.load_local(f'{VECTORSTORE_PATH}/{episode_name}', embeddings)
# #             docsearch.add_texts(texts, metadatas)
            
# #     print('Saving')
# #     docsearch.save_local(f'{VECTORSTORE_PATH}/{episode_name}')

___
## (8) Merge Vectorstores

In [169]:
db1 = FAISS.load_local(f"{VECTORSTORE_PATH}/A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen", embeddings)
db2 = FAISS.load_local(f"{VECTORSTORE_PATH}/A Third Path to Talent Development - Delta's Michelle McCrackin", embeddings)

In [170]:
db1.merge_from(db2)

In [171]:
db1.docstore._dict

{'10809e42-d822-4961-a237-2f1fee0fa844': Document(page_content="Our guests often use Lego as an analogy for how organizations can build up solutions with data. But today, find out how Lego itself builds data components that connect as easily as it breaks. I'm Anders Putschbard-Kressensson from the Lego Group and you're listening to Me, Myself and AI.", metadata={'start': 0.0, 'end': 28.34, 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo', 'date': '2023-03-28 07:00:00+00:00', 'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen", 'id': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen-t0.0"}),
 '25c862fd-5875-4273-aa14-357b7ac44996': Document(page_content="it breaks. I'm Anders Putschbard-Kressensson from the Lego Group and you're listening to Me, Myself and AI. Welcome to Me, Myself and AI, a podcast on artificial intelligence and business. Each episode, we introduce you to someone innovating with AI. I'm Sam Ransbotham, profe

In [112]:
db1.similarity_search('Who works at Lego?', k=2)

[Document(page_content="And I actually started out as a consultant seven years ago. I did mobile applications and websites and moved into project management of the clients that we built those products for. And then I think as so many other people in Denmark, we dream about working for Lego, right? We've all played with the bricks and we dream about working for them. It's not just Denmark.", metadata={'start': 858.7, 'end': 879.1, 'title': "A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen", 'url': 'https://open.spotify.com/episode/1uTJp2EeePc29X4N1OsGoo', 'date': '2023-03-28 07:00:00+00:00'}),
 Document(page_content="Welcome. Thanks for having me, Sam. First, tell us a little bit about what you do at Lego Group. I'm heading up the data engineering department within the Lego Group. We currently consist of three large global product teams within my area. Two of the teams focus on self-service, enabling the organization to make data-driven decisions. And the last one is 

___

In [173]:
episode_list = os.listdir(VECTORSTORE_PATH)
episode_list

["A One-Stop Data Shop - The Lego Group's Anders Butzbach Christensen",
 "A Third Path to Talent Development - Delta's Michelle McCrackin",
 "AI in Aerospace - Boeing's Helen Lee",
 "AI in Your Living Room - Peloton's Sanjay Nichani",
 'all_podcasts',
 "Big Data in Agriculture - Land O'Lakes' Teddy Bekele",
 "Choreographing Human-Machine Collaboration - Spotify's Sidney Madison Prescott",
 "Digital First, Physical Second - Wayfair's Fiona Tan",
 "Extreme Innovation with AI - Stanley Black and Decker's Mark Maybury",
 "From Data to Wisdom - Novo Nordisk's Tonia Sideri",
 "From Journalism to Jeans - Levi Strauss' Katia Walsh",
 "Helping Doctors Make Better Decisions with Data - UC Berkley's Ziad Obermeyer",
 "Imagining Furniture (and the Future) with AI - IKEA Retail's Barbara Martin Coppola",
 "Inventing the Beauty of the Future - L'Oreal's Stephane Lannuzel",
 "Investing in the Last Mile - PayPal's Khatereh Khodavirdi",
 "Keeping Humans in the (Feedback) Loop - Orangetheory Fitness' Am

In [179]:
def merge_vectorstores(episode_list, embeddings):
    for i, episode in enumerate(episode_list):
        # Skip the vectorstore that contains all episodes
        if episode == 'all_podcasts':
            pass
        else:
            if i == 0:
                db1 = FAISS.load_local(f"{VECTORSTORE_PATH}/{episode}", embeddings)
            else:
                db2 = FAISS.load_local(f"{VECTORSTORE_PATH}/{episode}", embeddings)
                db1.merge_from(db2)
                
    return db1

In [190]:
# Merging is very fast
db = merge_vectorstores(episode_list[2:6], embeddings)

In [191]:
len(db.docstore._dict)

297