### Setup paths

In [1]:
import sys
import os
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Get the absolute path of the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root)

### Step 0: (Optional) Download audio files from YouTube

In this project, I'm using podcasts from youtube as examples. This step is not needed if you already have an mp3 file, skip to step 1.

Takes ~10 seconds to download a 3 hour video

In [3]:
from data_pull_and_prep.audio_from_yt import download_audio

video_url = "https://www.youtube.com/watch?v=klRb0_BAX9g"  # Replace with your video URL
video_name = "peter_thiel"  # Replace with your video name
output_dir = project_root+"/data/audio_2/"  # Replace with your output directory

download_audio(video_url, video_name, output_dir)

Downloading audio...
Audio downloaded: /Users/rishikeshdhayarkar/rag-audio-indexing/data/audio_2/peter_thiel.mp3


### Step 1: Convert mp3 file to text and generate time stamps for each character

In [3]:
import data_pull_and_prep.utils as utils
import data_pull_and_prep.data_preparation as data_prep
import textwrap

Convert audio to text using Open AI whisper

Takes ~6 minutes time to transcribe a ~3 hour video using OpenAI Whisper and it costs ~$0.05 

In [5]:
audio_file_path = project_root+"/data/audio_2/peter_thiel.mp3"
transcription = data_prep.transcribe(audio_file_path)

Transcribed output contains an id, piece of converted text, start time and end time in the audio clip for this text. 

In [7]:
print(len(transcription))
print(f"id: {transcription[5][0]}")
print(f"text: {transcription[5][1]}")
print(f"start time: {transcription[5][2]}")
print(f"end time: {transcription[5][3]}")

2635
id: 5
text:  My pleasure. Thanks for having me.
start time: 14.0
end time: 16.0


For each such segment(above cell), calculate the time stamp for each character in text by interpolation.

But why do we need character level time stamps?
Character level timestamps provide the flexibility to create textchunks of any size.

In [4]:
transcription_with_char_timestamps = utils.import_pkl_file(project_root+"/data/audio_2/transcription_with_char_timestamps_peter_thiel.pkl")

In [8]:
# transcription_with_char_timestamps = data_prep.map_characters_to_timestamps(transcription)

In [5]:
print(f"Total number of characters: {len(transcription_with_char_timestamps)}")
transcription_with_char_timestamps[:5]

Total number of characters: 188284


[(' ', 0.0),
 ('G', 0.12121212121212122),
 ('o', 0.24242424242424243),
 (' ', 0.36363636363636365),
 ('R', 0.48484848484848486)]

Save character level timestamps

In [10]:
# utils.save_as_pickle_file(directory=project_root+"/data/audio_2/",
#                     filename="transcription_with_char_timestamps_peter_thiel.pkl",
#                     data=transcription_with_char_timestamps)

Create custom chunks using SentenceSplitter from Llamaindex.

In [6]:
custom_chunking_obj = data_prep.CreateCustomTextChunks(transcription_with_char_timestamps)
text_chunks_with_timestamps = custom_chunking_obj.create_custom_text_chunks()

In [7]:
print(f"Number of text chunks: {len(text_chunks_with_timestamps)}")

Number of text chunks: 54


In [8]:
print(textwrap.fill(str(text_chunks_with_timestamps[0]), width=160))

("Go Roman, fly gas, check it out! The Joe, Rogan, experience. Train my day, Joe Rogan, podcast my night all day! What's up, man? Good to see you. Glad to be on
the show. My pleasure. Thanks for having me. My pleasure. What's cracking? How you doing? Doing all right? We were just talking about how you're still trapped
in LA. I'm still trapped in LA. I know. Your friends are a lot of people out here. Have you thought about jettison? I talk about it all the time. But you know,
it's always talk is often a substitute for action. It's always does it mean to action or does it end up substituting for action? That's a good point. But I have
endless conversations about leaving. And moved from San Francisco to LA back in 2018. That felt about as big a move away as possible. And I keep the extreme
thing I keep saying. And you're going to have to keep my talk as a substitute for action. The extreme thing I keep saying is I can't decide whether to leave the
state or the country. Oh boy. And you kno

### Step2: Create textnodes and add them to a vector store

In [9]:
import basic_rag.rag as rag
from dotenv import load_dotenv

dotenv_path = '.env'
load_dotenv(dotenv_path=dotenv_path)

pinecone_api_key = os.environ["PINECONE_API_KEY"]
openai_api_key = os.environ["OPENAI_API_KEY"]

In [10]:
custom_ingestion_obj = rag.CustomRAG(pinecone_api_key=pinecone_api_key,
              openai_api_key=openai_api_key,
              index_name="peter-thiel-08-27-via-class-trial1",
              text_chunks_with_timestamps=text_chunks_with_timestamps[:10]
              )

Takes ~10 seconds to upload all text nodes to pinecone vector store.

In [11]:
await custom_ingestion_obj.create_text_nodes_and_add_to_vector_store()

100%|██████████| 5/5 [00:01<00:00,  3.82it/s]
100%|██████████| 10/10 [00:05<00:00,  1.89it/s]
Upserted vectors: 100%|██████████| 10/10 [00:01<00:00,  9.89it/s]


### Step 3: Embedding retrieval from vector store

In [19]:
query_str = "What are peter thiel's thoughts on taxes collected for social security?"

In [20]:
custom_retriever_obj = rag.CustomRetriever(embed_model=custom_ingestion_obj.embed_model,
                                           vector_store=custom_ingestion_obj.vector_store)
query_result = custom_retriever_obj.retrieve(query=query_str)

### Step 4: Response Synthesis

In [22]:
import basic_rag.response_synthesizer as response_synthesizer

response_synthesizer_obj = response_synthesizer.HierarchicalSummarizer(llm=custom_ingestion_obj.llm)
response = response_synthesizer_obj.generate_response_hs(retrieved_nodes=query_result.nodes, query_str=query_str)                                                         

In [23]:
print(textwrap.fill(response, 80))

Peter Thiel believes that the current social security tax system is regressive
and should be changed. He argues that the tax should not be capped at a certain
income level and should be more progressive. Thiel also believes that social
security should be means tested, with benefits only going to those who truly
need it.


### A wrapper that takes query and returns an answer string(step 3 + step 4)

In [25]:
from basic_rag.utils import RetrieveAndAnswer

In [27]:
raa = RetrieveAndAnswer(ingestion_obj=custom_ingestion_obj)

query_str = "What are peter thiel's thoughts on taxes collected for social security?"
response = raa.answer(query_str)
print(textwrap.fill(response, 80))

Peter Thiel believes that the current social security tax system is regressive
and should be changed. He argues that the tax should not be capped at a certain
income level, as it currently is, and that it should be more progressive. Thiel
also believes that social security should be means tested, so that only those
who truly need it receive benefits. He suggests increasing the age for social
security and gradually dialing back government benefits in order to mitigate the
deficit issue. Thiel's views on taxes collected for social security align with
his libertarian beliefs of smaller government and reducing government spending
on social programs.


In [44]:
query_str = "Explain the oil wealth analogy/scenario that Peter Thiel talks about."
response = raa.answer(query_str)
print(textwrap.fill(response, 80))

Peter Thiel uses the oil wealth analogy to compare how California relies on big
tech companies like Google and Apple to generate significant wealth, similar to
how Saudi Arabia relies on oil wealth to sustain its economy and government.
Just as oil wealth sustains Saudi Arabia despite its flaws, the wealth generated
by the tech industry in California essentially pays for the state's
inefficiencies and distortions. Thiel suggests that as long as there is a source
of significant wealth, whether it be oil or tech, a state like California can
continue to function despite its perceived ridiculousness. Additionally, Thiel
also discusses how the rapid development and potential impact of artificial
intelligence (AI) in the 21st century can be compared to the oil boom in the
20th century, emphasizing the transformative potential of AI technology.


#### Don't know what questions to ask? Not a problem!

As a part of textnode creation each textnode gets mappped to set of LLM generated relevant questions and topics/titles.

Any number of random questions and topics/titles can be generated for a given audio clip. 

In [65]:
from basic_rag.utils import RandomQuestionAndTopics

In [66]:
random_questions_and_topics_obj = RandomQuestionAndTopics(ingestion_obj=custom_ingestion_obj)

In [67]:
random_questions_and_topics_obj.print_questions_and_topics(
    random_questions_and_topics_obj.get_random_questions_and_titles())

╔═════════════════════════════════════════════════════════════════════════════╗
║ Questions                                                                   ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ 1. What are the challenges and potential solutions for addressing the budget  ║
║    deficit in California, including the options of raising taxes, cutting     ║
║    spending, or continuing to borrow money?                                   ║
║                                                                             ║
║ 2. What are the implications of California's tax policies, particularly the   ║
║    recent increase in taxes, on its population and revenue collection?        ║
║                                                                             ║
║ 3. How has the development of chat GPT in the early 2020s changed the         ║
║    definition and perception of AI, particularly in relation to passing the   ║
║    Turing test and achie

In [68]:
random_questions_and_topics_obj.print_questions_and_topics(
    random_questions_and_topics_obj.get_random_questions_and_titles())

╔═════════════════════════════════════════════════════════════════════════════╗
║ Questions                                                                   ║
╠═════════════════════════════════════════════════════════════════════════════╣
║ 1. What comparisons can be drawn between California and Saudi Arabia in terms ║
║    of their reliance on oil wealth and tech industry wealth, and how does     ║
║    this impact their respective economies and societies?                      ║
║                                                                             ║
║ 2. What challenges have individuals faced when trying to relocate from        ║
║    California to states like Florida and Texas due to the increase in housing ║
║    prices and interest rates?                                                 ║
║                                                                             ║
║ 3. How do discussions about potentially leaving California or the US          ║
║    altogether reflect br