<a href="https://colab.research.google.com/github/imusicmash/stanford_llm_python/blob/main/TECH16_LLM_Final_Project_PodcastExplorer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PodCast Explorer: A Better way to navigate podcasts**

#Problem Statement:

My motivation for this project is personal. I recently got interested in real estate investing. Like many novices in a new space, I relied on podcasts to learn. <br> <br> Enter Problem 1: There are endless real estate podcasts. Which is the best for me? After a few tries I settled on the *Bigger Pockets* podcast. <br> <br> Enter Problem 2: Bigger pockets has over 900 podcasts episodes AND counting on various real estate investment topics and strategies. After a few episodes I identified what type of investment strategy suited my personality and risk tolerance. I wanted to listen to episodes covering this strategy only but had to keep listening to all 900 in hopes that my topic would be covered. <br> <br> Question: Can I simplify podcast exploration for someone like me looking to learn about a new field through podcasts but feeling overwhelmed with content? Maybe I have a vague idea of where to start learning and want to easily find good content without listening for months to countless episodes majority of which I'm not interested in. Also, how can I efficiently discover new podcasts beyond the one or two I started with? Surely Bigger Pockets is not the only good real estate podcast out there.

#Solution:

I would like to solve this using an app with a conversational interface. A user can ask about a topic, be presented with top n podcasts episodes that cover this topic along with a summary of said episode if they please. This app will have access to all podcast content so top returns for each query will likely come from different podcasts, giving the user various opportunities to discover new podcasts.

#Scope:

Working on this project solo I only got a skeleton of the overall idea done. I used google's youtube api to download 90 episodes of Bigger pockets podcasts, transcribed them with OpenAI's whisper python package and saved the transcriptions along with youtube url to a PineCone vector database. I then used langchain to query the database. I did not have enough time to create a streamlit app for the demo but will plan to keep working on this over the coming weeks.

##Install packages, api keys etc

In [None]:
!pip install -qU openai pod-gpt datasets git+https://github.com/openai/whisper.git
!pip install -q --upgrade pytube

In [None]:
import os
from google.colab import userdata

# openai api key
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
# pinecone api key
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
# pinecone environment
PINECONE_ENV = userdata.get("PINECONE_ENVIRONMENT")
# youtube api key
YOUTUBE_API_KEY = userdata.get('YOUTUBE_API_KEY')


#Download Youtube videos, transcribe and save to PineCone db

In [None]:
import pod_gpt

channel = pod_gpt.Channel(
    channel_id='UCVWDbXqQ8cupuVpotWNt2eg',  # Bigger Pockets YouTube Channel id
    api_key=YOUTUBE_API_KEY  # Google YouTube API
)

In [None]:
channel.get_videos_info(max_results=90)

In [None]:
import torch
import whisper

# prep whisper model
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

model = whisper.load_model("tiny.en").to(device)

In [None]:
channel.get_videos()

In [None]:
channel.transcribe_videos(model)

  0%|          | 0/90 [00:00<?, ?it/s]

In [None]:
channel.save(filepath='BiggerPockets/transcripts.jsonl')

In [None]:
from datasets import load_dataset

data = load_dataset(
    'BiggerPockets/',
    split='train'
)
data

Dataset({
    features: ['video_id', 'channel_id', 'title', 'published', 'source', 'transcript'],
    num_rows: 90
})

In [None]:
import pod_gpt

indexer = pod_gpt.Indexer(
    openai_api_key=OPENAI_API_KEY,
    pinecone_api_key=PINECONE_API_KEY,
    pinecone_environment=PINECONE_ENV,
    index_name="llm-project"
)

In [None]:
from tqdm.auto import tqdm

for row in tqdm(data):
    row['published'] = row['published'].strftime('%Y%m%d')
    indexer(pod_gpt.VideoRecord(**row))

  0%|          | 0/90 [00:00<?, ?it/s]

#Use LangChain to query documents in PineCone db

In [None]:
!pip install -qU pinecone-client==3.0.0 pinecone-datasets==0.7.0 langchain-pinecone==0.0.3 langchain-openai==0.0.7 langchain==0.1.9

In [None]:
from langchain_openai import OpenAIEmbeddings

# get openai api key from platform.openai.com
model_name = 'text-embedding-ada-002'

embeddings = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

In [None]:
from pinecone import Pinecone, ServerlessSpec, PodSpec
import time
import os

use_serverless = False

# configure client
pc = Pinecone(api_key=PINECONE_API_KEY)

index_name="llm-project"
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.01262,
 'namespaces': {'': {'vector_count': 1262}},
 'total_vector_count': 1262}

In [None]:
from langchain_pinecone import PineconeVectorStore

text_field = "text"

vectorstore = PineconeVectorStore(
    index, embeddings, text_field
)

In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)


In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

from langchain_core.messages import SystemMessage
from langchain.prompts import PromptTemplate

prompt_template = """You are a helpful assistant that helps the user answer questions from video transcriptions of podcasts. If you don't know the answer, just say I dont know. Don't try to make up an answer. ALWAYS return SOURCES.
The SOURCES part should be a reference to the source of the document from which you got your answer.

Example of your response should be:

The answer is xyz
SOURCES:
1. xyz
2. xyz
3. xyz
4. xyz
5. xyz

Begin!
=========
{summaries}
========="""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
    chain_type_kwargs=chain_type_kwargs,
    return_source_documents=True
)


In [None]:
query = "Is buying a house for apprecation over cash flow wise?"
qa_with_sources(query)

{'question': 'Is buying a house for apprecation over cash flow wise?',
 'answer': "The answer is: The upside to investing in properties with good cash flow is that you can become very efficient, get properties under market value, know how to add value, and scale relatively quickly without a lot of risk. The downside is that there's not a big upside in terms of appreciation, rents may not keep up with other areas, and you're exposed to more capital expenditure downside.\n",
 'sources': '',
 'source_documents': [Document(page_content="But for the time being, that sounds like a good option to me. Okay. That's something that you like then you should this, then you have your plan, like that's what you're looking at. You're not looking at the boulder. The upside to this is that you can become very efficient. If you know the market, if you know these properties, you can get them under market value. You know how to add value to them and you know that you can cash flow. You can, you can scale r

In [None]:
query = "Tell me about tax benefits of investing in real estate"
qa_with_sources(query)

{'question': 'Tell me about tax benefits of investing in real estate',
 'answer': 'The answer is the four wealth generators in real estate investing are cash flow, appreciation, loan paydown, and tax benefits.\n',
 'sources': '',
 'source_documents': [Document(page_content="But basically, this is just the money you're left with in your pocket at the end of every month. Now number two, appreciation. This is basically the simple truth that real estate tends to climb over time in value. Now, sure, things like 2008 do happen in prices, do drop sometimes a lot, but over time, prices tend to climb. As long as you can hold on to a property long enough, you should always see appreciation. And that's why cash flow, which we just talked about a second ago, is so vital, right? Because as long as I'm making cash flow, I can hold on to it as long as I need to, waiting for the property to climb in value. Now number three, the loan paydown. Now, normally, when you buy a piece of real estate, you get 