# Doc GPT practice

 - RAG( Retrieval Augmented Generation) - models learn from data, some data are private and models cannot know. Prepare relevant documents to the prompt as a reference when models was not trained on that data. send the data to the model that was trained with other data. Increase the model's capability. => This is an example of stuff document. depending on other variables, you can choose what type of RAG you will use.

### Retrival
    - Steps of Data connection: Source -> Load -> Transform -> Embed -> Store -> Retrieve

    - Data loaders(: extract data from source and bring it to langchain)
    - Splitters(Transform)

#### Load

##### unstructured file
 - support txt, pdf, ppt, html, images, and more

In [27]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import UnstructuredFileLoader
# Loaders: TextLoader, PyPDFLoader ; There are many. Instead, you can use 
# UnstructuredFileLoader for various file types.

In [28]:

loader = UnstructuredFileLoader("./files/chapter_one.txt")
loader.load()
# The return value of loader.load() is a list with a document in it.
len(loader.load()) # output: 1
# Better to find information from small chunks

1

##### Split(transform)

In [45]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

 - recursive character text splitter

In [38]:
splitter = RecursiveCharacterTextSplitter()
# Splitters try to keep the meaningful semantic structure.

In [39]:
# You can split it into smaller chucks but this can destroy sentences.
splitter = RecursiveCharacterTextSplitter(
    # set chunk size: cause damage to the meaning
    chunk_size=200,
    # overlap the ending of the previous chunk to get meaningful structure.
    chunk_overlap=50
)

In [None]:
docs = loader.load_and_split(text_splitter=splitter)
len(docs)


- character text splitter
     - it has a separator property

In [46]:
char_splitter = CharacterTextSplitter(
    # put separator
    separator="\n",
    # set a max number of characters
    chunk_size =600,
    chunk_overlap=100,
    # count length of the text by using len function by default
    length_function=len,
    # LLM does not count token by the length of text.
)

In [49]:
docs = loader.load_and_split(text_splitter=char_splitter)

##### Difference between character and token
- tokens are not characters.
- models read text as id(numbers)
<br>
Let's implement token count.

In [51]:
# align how the model counts with how we count
token_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator="\n",
    chunk_size =600,
    chunk_overlap=100,
)
docs = loader.load_and_split(text_splitter=token_splitter)

##### Embedding
    - convert the human-reading text into computer-reading numbers
- vectors: create vector for each document
- openAI has more than 1,000 dimensions but here we focus on 3D vector.

For example<br>
3D: masculinity(m), femininity(f), royalty(r)<br>
the word<br>
'king' : 0.9m | 0.1f | 1.0r<br>
'man'  : 0.9m | 0.1f | 0.0r<br>
'queen': 0.1m | 0.9f | 1.0r<br>
- you can do operation with words to get other words<br>
king - man : <br>
0.0m | 0.0f | 1.0r ==> 'royal'<br>
'royal' : 0.0m | 0.0f | 1.0r<br>
'woman': 0.1m | 0.9f | 0.0r<br>
royal + woman ==> 0.1m | 0.9f| 1.0f : 'queen'<br>
=> This is what embedding does<br><br>
- take movies in the vectors. according to the vicinity in the vector, the algorithm can find you the movies similar to the movie you just watched

In [75]:
from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings
from dotenv import dotenv_values


# Load variables from .env file
env_vars = dotenv_values('.env')

# Access the environment variables
api_key = env_vars.get('OPENAI_API_KEY')

In [76]:
embedder = OpenAIEmbeddings(openai_api_key= api_key)
vector = embedder.embed_query("Hi")
len(vector) # 1536 dimensions for a word, 'hi'

1536

In [59]:
vector_doc = embedder.embed_documents([
    "hi",
    "How are you?",
    "I am fine",
    "and",
    "you?"
])
len(vector_doc) # 5
len(vector_doc[0]) # 1536 each vector has the same length

1536

When embedding, we are going to save the embed not to repeat embedding process.<br>
<br>
vector store saves vectors and allows searching

In [77]:
# You can use FAISS instead of Chroma, and don't forget to change all Chroma variables to FAISS
from langchain.vectorstores.chroma import Chroma

In [86]:
from langchain.vectorstores.faiss import FAISS

In [87]:
vectorstore = FAISS.from_documents(docs, embedder)

In [None]:
results = vectorstore.similarity_search("what did Eve eat?")
# This chunks will be sent to the model and the model will find answers from these documents: smaller pieces will help reduce costs
results

##### Cache embeddings
    -   if we restart it, it will calculate the vectors again which costs money.

In [89]:
from langchain.storage import LocalFileStore

cache_dir = LocalFileStore("./.cache/")
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    embedder, cache_dir
)

In [107]:
vectorstore = FAISS.from_documents(docs, cached_embeddings)

##### Retrieve: search

In [None]:
results = vectorstore.similarity_search("what did Eve eat?")
# This chunks will be sent to the model and the model will find answers from these documents: smaller pieces will help reduce costs
results

# Document Chain
- here we are going to ask questions with our documents
- different types of chain: stuff, refine, Map reduce, Map re-rank
- off-the-shelf chain.
 - You can change the type of chain by modifying 'chain_type' argument.

##### Stuff
- insert all the document provided to the promtp

In [108]:
# Here, first we are using an off-the-shelf chain set up already.
from langchain.chains import RetrievalQA
llm = ChatOpenAI(api_key=api_key)
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # default is 'stuff', here make it explicit on purpose
    # This RetrievalQA chain require 'retriever' argument
    # retriever: interface that "return" documents given queries.
    # here we use vector store for it.
    retriever=vectorstore.as_retriever(),
)

In [None]:
chain.run("What was the context of Cain killing Abel?")

##### Refine
- try to answer through each individual documents updating each time creating answers. The final answer should be the one the most refined.

In [111]:
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="refine",
    retriever=vectorstore.as_retriever(),
)


In [None]:
chain.run("What happened to the lineage of Adam and Eve?")

##### Map reduce
- summarize each individual documents, send over to llm

In [105]:
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=vectorstore.as_retriever(),
)
chain.run("What did Adam say?")

'There is no mention of Adam speaking in the given portion of the document.'

##### Map re-rank
- creat an answer from each documents and give scores to answers.
- return the answer of the highest score. 

In [None]:
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_rerank",
    retriever=vectorstore.as_retriever(),
)
chain.run("What did Adam say?")

# LCEL(langChain Expression Language) chain
- Implement from prompt to output parser ourselves.
- Customize the chain
- Retriever takes a single-string argument(query) => return list of documents


In [114]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

In [116]:
vectorstore = FAISS.from_documents(docs, cached_embeddings)
# first get the documents with retriever
retriever = vectorstore.as_retriever()
# second get a prompt and send lists of documents(output of retriever) to the prompts
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer questions using on;y the following context. If you don't know the answer just say you don't know, don't make it up:\n{context}"),
    ("human", "{question}")
])
# third get llm chain to get the data from prompts
llm = ChatOpenAI(
    api_key=api_key,
    temperature=0.1,
    )
# create a chain in order
"""chain = retriever | prompt | llm"""
# The given question will be sent to the retriever, retrieve will return list of documents based on the question, documents will be the {context}, the given question will be {question}. 
#RunnablePassThrough: pass the input to the variable in the prompt
chain = {"context": retriever, "question":RunnablePassthrough()} | prompt| llm 
# here use invoke() method, instead of run()
chain.invoke("What happened to Eve?")


AIMessage(content='According to the context, after Eve and Adam ate the fruit from the tree of knowledge of good and evil, they realized they were naked and sewed fig leaves together to cover themselves. They then heard the voice of the LORD God walking in the garden and hid themselves. The LORD God called out to Adam and asked where he was. Eve is not specifically mentioned after this event.')

### Map Reduce chain implementation
- List of documents
- for doc in list of documents | prompt | llm
- iterating each document and send it through a prompt to llm and check if there is any relevant information
- all responses from each llm will be put together into one document
- the final document will be sent through  prompt to llm => final answer.
- This approach is more appropriate due to the limited size that a prompt can contain: only relevant document will be sent to the final llm.

In [121]:
# Runnable Lambda class allows calling a function inside any chain.
from langchain.schema.runnable import RunnableLambda     

In [129]:
vectorstore = FAISS.from_documents(docs, cached_embeddings)
retriever = vectorstore.as_retriever()
# set up prompt for individual document
map_doc_prompt = ChatPromptTemplate.from_messages([
    ("system", 
     """
     Use the following portion of a long docuement to see if any of the text is relevant to answer the question. Return any relevant text verbatim 
     -----
     {context}
     
     """),
    ("human", "{question}")
])
# set up a chain for individual document: return relevant information to the question
map_doc_chain = map_doc_prompt | llm

# function to iterate each document with map_doc_chain
def map_docs(inputs):
    # The return has to be a single string
    # define variables:note that the keys are the arguments from map_chain
    documents = inputs['documents']
    question = inputs['question']
    # iterate documents with map_doc_chain
    """
        results = []
        for doc in documents:
        result = map_doc_chain.invoke({
            "context": doc.page_content,
            "question": question
            #document class has an attribute of content to bring the content of the document
        }).content
        # append the result into results list
        results.append(result)
    results = "\n\n".join(results) """
    # better code: list of the results will be joined to a single string
    return "\n\n".join(map_doc_chain.invoke({
        "context": doc.page_content,
        "question": question
        }).content for doc in documents
    )

# set up another chain for document iteration: Here the documents are processed through map_doc_prompt with the map_doc function

# the arguments in the first chain will be sent to map_doc function through RunnableLambda class. => After all the iteration, a single string is returned.
map_chain = {"documents":retriever, "question":RunnablePassthrough()} | RunnableLambda(map_docs)

# final prompt is used after all the individual prompts were processed.
final_prompt = ChatPromptTemplate.from_messages([
    ("system",
     """
     Given the following extracted parts of a long document and a question, create a final answer.
     If you don't know the answer, just say that you don't know. Don't try to makw up an answer.
     -----
     {context}
     """),
    ("human", "{question}")
])
# context argument for the final chain is the responses from each map_chain
# through map_chain, the return value is a string, and this will be sent to the final prompt with question given.
chain ={"context":map_chain, "question":RunnablePassthrough()} | final_prompt | llm

chain.invoke("What is Eve?")

AIMessage(content="Eve is the name of Adam's wife, who is referred to as the mother of all living. She is the woman that God created from one of Adam's ribs and brought to him in the garden of Eden.")