## Retrieval-Augmented Generation 
### Data Loaders and Splitters  

문서 분할 -> 필요한 걸 효율적으로 탐색, 프롬프트에 제공

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    separators="\n",
    chunk_size=600,
    chunk_overlap = 100,
    
)

loader = UnstructuredFileLoader("./files/geoge_owel.txt")


docs = loader.load_and_split(text_splitter=splitter)

# splitter.split_documents(docs)

tokenizer  
    -> token, token id

tictoken package

vectorize

https://turbomaze.github.io/word2vecjson/

https://www.youtube.com/watch?v=2eWuYf-aZE4


#### Retrieval QA

document chain type = stuff, refine, map reduce, map re-rank

In [10]:
from langchain.embeddings import OpenAIEmbeddings,CacheBackedEmbeddings
from langchain.vectorstores import Chroma,FAISS
from langchain.storage import LocalFileStore
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0.1)
cache_dir = LocalFileStore("./.cache/")


embedder = OpenAIEmbeddings()

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    embedder, cache_dir
)

vectorstore = FAISS.from_documents(docs, cached_embeddings)



## stuff LCEL Chain

In [12]:
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

retriever = vectorstore.as_retriever()
prompt = ChatPromptTemplate.from_messages([
    ("system", "you are a helpful assitant. Answer questions using only the following context. If you don't know the answer just say you don't know, don't make it up.\n{context}")
, ("human", "{question}")])
chain = {"context":retriever,"question": RunnablePassthrough()} | prompt |llm

chain.invoke("Describe Victory Mansions")

AIMessage(content='Victory Mansions is a building where Winston Smith resides. It is described as having glass doors at the entrance, which allow gritty dust to enter along with people. The hallway of Victory Mansions has a smell of boiled cabbage and old rag mats. There is a large colored poster on one end of the hallway, depicting the face of a man with a black mustache. The building has seven floors, and the lift is often not working due to the electricity being cut off during daylight hours. The poster with the enormous face, with the caption "BIG BROTHER IS WATCHING YOU," is present on each landing opposite the lift-shaft.')

## map reduce LCEL Chain  
모델 타입 결정 -> prompt의 크기와 doc의 수에 따라 달라짐

map reduce => doc이 매우많은 경우 적합

In [14]:
from langchain.schema.runnable import RunnableLambda
# RunnableLambda : 체인과 그 내부에서 함수를 호출할 수 있도록 해줌
map_doc_prompt = ChatPromptTemplate.from_messages([
    ("system", 
    """
    Use the following context of a long document to see if any of the text is relevant to answer the question. Return any relevant text verbatim.
    ---
    {context}
    """
    ),
    ("human","{question}"),
])

map_doc_chain = map_doc_prompt | llm

def map_docs(inputs):
    documents = inputs["documents"]
    question = inputs["question"]
    # results = []
    # for document in documents :
    #     result =  map_doc_chain.invoke({
    #         "context": document.page_content,
    #         "question": question
    #     }).content
    #     result.append(result)
    # results = "\n\n".join(results)
    return "\n\n".join(map_doc_chain.invoke({"context": doc.page_content,"question": question}).content for doc in documents)

map_chain = {"documents": retriever,"question" : RunnablePassthrough()}| RunnableLambda(map_docs)

final_prompt = ChatPromptTemplate.from_messages([
    ("system", 
    """
    Given the following extracted parts of a long document and a question, create a final answer.
    if you don't know the answer, just say that you don't know. Don't try to make up an answer.
    ---
    {context}
    """
    ),
    ("human","{question}"),
])


chain = {"context":map_chain ,"question" : RunnablePassthrough()} |final_prompt |llm

chain.invoke("Describe Victory Mansions")

AIMessage(content='Victory Mansions is a building complex located in London, specifically in Airstrip One, which is the chief city of Oceania. The exact size and appearance of Victory Mansions are not mentioned in the given context. However, it is overshadowed by the Ministry of Truth and three other similar buildings. The building has glass doors that Winston Smith enters, and the hallway has a distinct smell of boiled cabbage and old rag mats. There is a large colored poster at one end of the hallway, depicting the face of a man in his forties with a black mustache. The flat that Winston lives in is on the seventh floor, and he usually has to climb the stairs since the lift is often not working. On each landing, there is a poster with the face of a man, and the caption beneath it reads "BIG BROTHER IS WATCHING YOU." Inside the flat, there is a telescreen, an oblong metal plaque on the right-hand wall that cannot be completely shut off. The living-room of the flat has a unique layout,