# TRANSCRIPT.IA: KNOWLEDGE IS ALL YOU NEED

### Here's an overview of the project:

![Alt text](../images/Untitled-2024-10-12-1138.png)




#### Loading the environment variables we will use. 
##### Make sure you have a GPT API and a Pinecone API!!

In [1]:
import os
from dotenv import load_dotenv


load_dotenv(".env")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=sY8aFSY2zv4"


## Setting up the model

#### Let's define the LLM model that we'll use as part of the workflow.

In [2]:
from langchain.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4o-mini")

  model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-4o-mini")


##### Let's do a simple test

In [3]:
""" model.invoke("What do you want to do?") """

' model.invoke("What do you want to do?") '

##### Now we will parse the answer to extract it as a strict using StrOutputParser

In [4]:
from langchain_core.output_parsers import StrOutputParser

parser=StrOutputParser()
chain=model | parser


## Starting with prompt templates
#### Here we are providing the model with some context and the question

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below.
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt= ChatPromptTemplate.from_template(template)
prompt.format(context="Miguel's friend is Fernando", question="Who is Miguel's friend?")


'Human: \nAnswer the question based on the context below.\nIf you can\'t answer the question, reply "I don\'t know".\n\nContext: Miguel\'s friend is Fernando\n\nQuestion: Who is Miguel\'s friend?\n'

In [6]:
chain = prompt | model | parser
chain.invoke({
    "context": "Miguel's friend is Fernando",
    "question": "Who is Miguel's friend?"
})

"Miguel's friend is Fernando."

![Alt text](../images/chain.png)

## Combining Cells
##### We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.Let's start by creating a new prompt template for the translation chain:

In [7]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

##### We can now create a new translation chain that combines the result from the first chain with the translation prompt.Here is what the new workflow looks like:

![Alt text](../images/translation.png)

In [8]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Miguel's friend is Fernando and he doesn't have more friends",
        "question": "How many friends does Fernando and Miguel have?",
        "language": "Spanish",
    })

# we can see this is not too acquarate... 

'Miguel tiene un amigo, que es Fernando, y Fernando no tiene más amigos. Por lo tanto, juntos tienen 2 amigos.'

## Transcribing the YouTube Video


In [9]:
import tempfile
import whisper
from pytubefix import YouTube


# Let's do this only if we haven't created the transcription file yet.
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO)
    audio = youtube.streams.filter(only_audio=True).first()

    # Let's load the base model. This is not the most accurate
    # model but it's fast.
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

##### Let's analyze the transcription:

In [10]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

'battle not with monsters, lest ye become a monster. And if you gaze into the abyss, the abyss gazes '

## Transcribing the YouTube Video
##### If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [11]:
try:
    chain.invoke(
    {
"context": transcription,
"question": "Give me some highlights about the interview"
    }
    )
except Exception as e:
    print(e)

## Splitting the transcription
##### Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

![Alt text](../images/rag.png)

In [12]:
from langchain_community.document_loaders import TextLoader

loader= TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'transcription.txt'}, page_content="battle not with monsters, lest ye become a monster. And if you gaze into the abyss, the abyss gazes also into you. Right. But I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. Are you sure about that? I'm betting my life on it. Following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of Maps of Meaning, 12 rules for life, and beyond order. This is the Lex Friedman podcast to support it. Please check out our sponsors in the description. And now, dear friends, here's Jordan Peterson. Does the Yevsky wrote in the idiot spoken through the character of Prince Michigan that beauty will save the world? Soldiers and actually mentioned this in his Nobel Prize acceptance speech. What do you think the Yevsky meant by that? Was you right? Well, I guess it's the divine that saves the world. Let's say you could say that by def

##### There are different ways to split a document, but here we will split it into chunks of fixed size. Here let's do 100 characters, with an overlap of 20 characters and display the firsts chunks.

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter= RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:5]

[Document(metadata={'source': 'transcription.txt'}, page_content="battle not with monsters, lest ye become a monster. And if you gaze into the abyss, the abyss gazes also into you. Right. But I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. Are you sure about that? I'm betting my life on it. Following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of Maps of Meaning, 12 rules for life, and beyond order. This is the Lex Friedman podcast to support it."),
 Document(metadata={'source': 'transcription.txt'}, page_content="to support it. Please check out our sponsors in the description. And now, dear friends, here's Jordan Peterson. Does the Yevsky wrote in the idiot spoken through the character of Prince Michigan that beauty will save the world? Soldiers and actually mentioned this in his Nobel Prize acceptance speech. What do you think the Yevsky meant by that? Was you right? Wel

In [14]:
documents = text_splitter.split_documents(text_documents)

![Alt text](../images/chunks.png)

#### Quick test

In [15]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query= embeddings.embed_query("Who is Mary's friend?")

print(f"Embedding lenght: {len(embedded_query)}")
print(embedded_query[:10])


Embedding lenght: 1536
[-0.006230046972632408, -0.02122783102095127, -0.005271578207612038, -0.018075820058584213, -0.0025553805753588676, 0.009121534414589405, -0.019786911085247993, -0.00039942897274158895, -0.017484012991189957, -0.02585935778915882]


#### Let's compare similarities

In [16]:
sentence1= embeddings.embed_query("Mary's sister is susana")
sentence2= embeddings.embed_query("Pedro's mother is a teacher")

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity=cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity=cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity


(0.8549904426038896, 0.7827007119231476)

## Setting up a Vector Store
##### We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a vector store.A vector store is a database of embeddings that specializes in fast similarity searches.

In [18]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Mary's sister is Susana",
        "John and Tommy are brothers",
        "Patricia likes white cars",
        "Pedro's mother is a teacher",
        "Lucia drives an Audi",
        "Mary has two siblings",
    ],
    embedding=embeddings,
)



In [19]:
vectorstore1.similarity_search_with_score(query="Who is Mary's sister?", k=3)

[(Document(metadata={}, page_content="Mary's sister is Susana"),
  0.9173394801008801),
 (Document(metadata={}, page_content='Mary has two siblings'),
  0.9044728659809688),
 (Document(metadata={}, page_content='John and Tommy are brothers'),
  0.8013463844876156)]

## Connecting the Vector Store to the chain
##### We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

![Alt text](../images/chain4.png)

#### We need to set up a Retriever, which runs the similarity search

In [20]:
retriever1=vectorstore1.as_retriever()
retriever1.invoke("Who's Mary's sister?")

[Document(metadata={}, page_content="Mary's sister is Susana"),
 Document(metadata={}, page_content='Mary has two siblings'),
 Document(metadata={}, page_content='John and Tommy are brothers'),
 Document(metadata={}, page_content="Pedro's mother is a teacher")]

##### Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

##### We can create a map with the two inputs by using the RunnableParallel and RunnablePassthrough classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [21]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("What color is Patricia's car?")

{'context': [Document(metadata={}, page_content='Patricia likes white cars'),
  Document(metadata={}, page_content='Lucia drives an Audi'),
  Document(metadata={}, page_content="Pedro's mother is a teacher"),
  Document(metadata={}, page_content="Mary's sister is Susana")],
 'question': "What color is Patricia's car?"}

In [22]:
chain= setup | prompt | model | parser
chain.invoke("What color is Patricia's car?")

'Patricia likes white cars.'

## Loading the transcription into the Vector Store
##### We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.



In [23]:
len(documents)

339

In [24]:
vectorstore2= DocArrayInMemorySearch.from_documents(documents, embeddings)
setup2=RunnableParallel(context=vectorstore2.as_retriever(), question=RunnablePassthrough())

In [25]:
chain_2 = {"context": vectorstore2.as_retriever(), "question": RunnablePassthrough()} | prompt | model | parser

In [26]:
chain_2.invoke("What jordan says about God?")

"Jordan mentions that God often says to people what they least want to hear, implying that God's guidance may not always align with what individuals desire or expect. He illustrates this concept through the story of Cain, where God tells Cain to look to his own devices, suggesting a theme of personal responsibility and introspection."

## Setting up Pinecone
#### So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use [Pinecone](https://www.pinecone.io/).

#### The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable PINECONE_API_KEY.

#### Then, we can load the transcription documents into Pinecone:

In [27]:
from langchain_pinecone import PineconeVectorStore



index_name= "test"


pinecone= PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)


  from tqdm.autonotebook import tqdm


In [28]:

pinecone.similarity_search("What Jordan says about Cain?")[:5]

[Document(id='61ae0bfd-a653-45e8-8c90-f39527e2bc38', metadata={'source': 'transcription.txt'}, page_content="in the miserable state you're in. So then Cain leaves his countenance falls as you might expect in Cain leaves. And he's so insensed by this because God has said, look, your problems are of your own making. And not only that, you invited the man. And not only that, you engaged in this creatively. And not only that, you're blaming it on me. And not only that, that's making you jealous of able who's your actual idol and goal. And Cain instead of changing kills able. Right. And then Cain's"),
 Document(id='b8a9f021-0d3f-4c20-9473-88ef2e61e35e', metadata={'source': 'transcription.txt'}, page_content="people. And that's a hell of a story because it's a story of Fratricidal murder that degenerates into genocide, flood and tyranny. So that's fun for the opening salvo of the story, let's say, enable in Cain, both make sacrifices. And for some reason, able sacrifices, please God. It's no

In [29]:
chain_pinecone = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()} | prompt | model | parser
)

chain_pinecone.invoke("What Jordan says about Cain?")

"Jordan discusses Cain's feelings of resentment and jealousy towards Abel due to the perceived favor Abel receives from God. He mentions that Cain's sacrifices are of lesser quality, which contributes to God rejecting them, leading Cain to feel burdened and frustrated while Abel thrives. Ultimately, Cain's envy and inability to take responsibility for his own situation result in him committing fratricide against Abel."