## RAG Architecture

A brief introduction and demo

## LangChain
LangChain is a framework for developing applications powered by language models. 

It's very useful to build RAG pipelines

In [37]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../data/unstructured/pdfs/atenttion_is_all_you_need.pdf")
data = loader.load()
print(f"This document has {len(data)} pages")

This document has 15 pages


### Text splitters

We need to break Documents into smaller chunks because is useful both, for indexing data and passing it to a model, large chunks are harder to search over and won't fit in model's context window.

It's recommended chunks between 500 and 1000 characters.

In [38]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    length_function=len,
    chunk_overlap=50
)

documents = text_splitter.split_documents(data)
print(f"Now we have {len(documents)} documents!")

Now we have 93 documents!


In [42]:
from langchain.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

# Instantiate OpenAI Embeddings
embedding_openai = OpenAIEmbeddings(model = "text-embedding-ada-002")

vector_store = Qdrant.from_documents(
    documents = documents,
    embedding = embedding_openai,
    location = ":memory:",
    collection_name = "transformers"
)

Generate Embeddings

In [3]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Instantiate OpenAI Embeddings
embedding_openai = OpenAIEmbeddings(model = "text-embedding-ada-002")

# Create Vector Store with Embeddings
vector_store = Chroma.from_documents(
    documents = documents,
    embedding = embedding_openai
)

Let's test it with a query

In [4]:
vector_store = Chroma(
    persist_directory = "../data/data_embeddings/attention-embeddings",
    embedding_function = embedding_openai
)

In [43]:
query = "What are transformers?"

docs = vector_store.similarity_search_with_score(query, k=5)
docs

[(Document(page_content='The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building', metadata={'source': '../data/unstructured/pdfs/atenttion_is_all_you_need.pdf', 'page': 1, '_id': 'b7b29e59a72d4551b2a00a58e1a9dab7', '_collection_name': 'transformers'}),
  0.8125454953207472),
 (Document(page_content='language modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention 

In [8]:
query = "What is attention?"

docs = vector_store.similarity_search_with_score(query, k=5)
docs

[(Document(page_content='reduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. Self-attention has been', metadata={'page': 1, 'source': '../data/unstructured/pdfs/atenttion_is_all_you_need.pdf'}),
  0.39198678731918335),
 (Document(page_content='reduced to a constant number of operations, albeit at the cost of reduced effective resolution due\nto averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as\ndescribed in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. 

Let's create a retriever

In [44]:
retriever = vector_store.as_retriever(
    search_kwargs = {"k":4}
)
query = "What are transformers?"
retriever.get_relevant_documents(query)

In [5]:
from langchain_openai.chat_models import ChatOpenAI
from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain

Finally we can create a Chain to ask question based on our base knowledge (Vector Store)

In [6]:
llm = ChatOpenAI(
    model_name = "gpt-3.5-turbo",
    temperature = 0.0,
    streaming=True
)

qa_chain_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = retriever,
    return_source_documents=True

)

Let's use our RAG pipeline 😎

In [8]:
qa_chain_with_sources.invoke({"question":"What are transformers?"})

{'question': 'What are transformers?',
 'answer': 'Transformers are a type of sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. They allow for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight GPUs.\n',
 'sources': '../data/unstructured/pdfs/atenttion_is_all_you_need.pdf',
 'source_documents': [Document(page_content='The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building', metadata={'page': 1, 'source': '../data/un

### Let's use Agents

In [12]:
import os
import pandas as pd
import glob
from typing import Dict, Any
import streamlit as st
from chromadb.api.models.Collection import Collection
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.chat_models import ChatOpenAI
from langchain.vectorstores.chroma import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.vectorstores import VectorStoreRetriever
from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.conversational_retrieval.base import BaseConversationalRetrievalChain
from langchain.prompts import SystemMessagePromptTemplate
from langchain.memory import ConversationBufferMemory

In [10]:
combine_docs_template_customer_service = """You are a Customer Service Agent from the {company} company and solve questions about the context provided. You can answer common questions
as how are you and things that people use in normal conversations, but when you receive questions about products, services or something else you have 
to answer only based in the context provided, so you can introduce your self as Customer Service Agent from the {company} company but only when someone asks who are you.
Use the following pieces of context to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------

{context}

Question: {question}
Helpful Answer:"
"""

template_customer_service = '''
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know.
Don't try to make up an answer. The context could be in any language, but you will always answer in the language of the user's question.
You have to act as Customer Service Agent from the HugggingFace company and solve questions about the context provided. You can answer common questions
as how are you and thins that people use in normal conversations, but when you receive questions about products, services or something else you have 
to answer only based in the context provided.

Context: {context}

Question: {question}
Answer:
'''


template_customer_service2 = '''
You have to act as Customer Service Agent from the {company} company and solve questions about the context provided. You can answer common questions
as how are you and thins that people use in normal conversations, but when you receive questions about products, services or something else you have 
to answer only based in the context provided.

'''

In [10]:
from langchain.callbacks.streaming_aiter import AsyncIteratorCallbackHandler
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler
from langchain.callbacks import StreamingStdOutCallbackHandler

In [45]:
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
# chat completion llm
llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    temperature=0.0,
    streaming=True,
    callbacks= [StreamingStdOutCallbackHandler()]
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

In [46]:
from langchain.agents import Tool

tools = [
    Tool(
        name='Knowledge Base',
        func=qa.invoke,
        description=(
            'use this tool when answering general knowledge queries to get '
            'more information about the topic'
        )
    )
]

In [50]:
from langchain.tools.retriever import create_retriever_tool

tool = create_retriever_tool(
    retriever,
    "knowledge_base",
    'use this tool when answering general knowledge queries to get more information about the topic'
)
tools = [tool]

In [53]:
tool("Transformers")

  warn_deprecated(


'The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building\n\n7 Conclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014\n\nlanguage modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attentio

In [64]:
from langchain.agents import initialize_agent
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory = memory,
    get_chat_history=lambda h : h,
    return_intermediate_steps = False
)

In [65]:
template_customer_service2 = """
Assistant is a large language model trained by {company}.

Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.

Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.

Assistant have to act as Customer Service Agent from the {company} company and solve questions about the context provided. He cans answer common questions
as how are you and things that people use in normal conversations.

Overall, Assistant is a powerful system that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
"""

In [66]:
agent.agent.llm_chain.prompt.messages[0].prompt.template = template_customer_service2.replace("{company}", "HuggingFace")

In [67]:
agent("Quién eres?")



[1m> Entering new AgentExecutor chain...[0m
```json
{
    "action": "Final "Soy Assistant, un modelo de lenguaje creado por HuggingFace para ayudarte con una amplia variedad de tareas y preguntas. ¿En qué puedo ayudarte hoy?"
}
```[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "Soy Assistant, un modelo de lenguaje creado por HuggingFace para ayudarte con una amplia variedad de tareas y preguntas. ¿En qué puedo ayudarte hoy?"
}
```[0m

[1m> Finished chain.[0m


{'input': 'Quién eres?',
 'chat_history': [HumanMessage(content='Quién eres?'),
  AIMessage(content='Soy Assistant, un modelo de lenguaje creado por HuggingFace para ayudarte con una amplia variedad de tareas y preguntas. ¿En qué puedo ayudarte hoy?')],
 'output': 'Soy Assistant, un modelo de lenguaje creado por HuggingFace para ayudarte con una amplia variedad de tareas y preguntas. ¿En qué puedo ayudarte hoy?'}

In [59]:
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler

agent.agent.llm_chain.llm.callbacks = [
    FinalStreamingStdOutCallbackHandler(
        answer_prefix_tokens=["Final", "Answer"]
    )
]

In [60]:
agent("What are transformers?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "knowledge_base",
    "action_input": "Transformers"
}
```[0m
Observation: [36;1m[1;3mThe Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building

7 Conclusion
In this work, we presented the Transformer, the first sequence transduction model based entirely on
attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with
multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly faster than architectures based
on recurrent or convolutional layers. On both WMT 2014 English-to-German and WM

{'input': 'What are transformers?',
 'chat_history': [HumanMessage(content='What are transformers?'),
  AIMessage(content='The Transformer is a sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. It can be trained faster than architectures based on recurrent or convolutional layers and has shown state-of-the-art performance in translation tasks.'),
  HumanMessage(content='What are transformers?'),
  AIMessage(content='The Transformer is a sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. It can be trained faster than architectures based on recurrent or convolutional layers and has shown state-of-the-art performance in translation tasks.')],
 'output': 'The Transformer is a sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. It can be trained faster than architectures based on recurrent or 

In [61]:
import sys

class CallbackHandler(StreamingStdOutCallbackHandler):
    def __init__(self):
        self.content: str = ""
        self.final_answer: bool = False

    def on_llm_new_token(self, token: str, **kwargs: any) -> None:
        self.content += token
        if "Final Answer" in self.content:
            # now we're in the final answer section, but don't print yet
            self.final_answer = True
            self.content = ""
        if self.final_answer:
            if '"action_input": "' in self.content:
                if token not in ["}"]:
                    sys.stdout.write(token)  # equal to `print(token, end="")`
                    sys.stdout.flush()

agent.agent.llm_chain.llm.callbacks = [CallbackHandler()]

In [62]:
agent("What are transformers?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "knowledge_base",
    "action_input": "Transformers (machine learning model)"
}
```[0m
Observation: [36;1m[1;3mThe Transformer allows for significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building

tion models in various tasks, allowing modeling of dependencies without regard to their distance in
the input or output sequences [ 2,19]. In all but a few cases [ 27], however, such attention mechanisms
are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
relying entirely on an attention mechanism to 

{'input': 'What are transformers?',
 'chat_history': [HumanMessage(content='What are transformers?'),
  AIMessage(content='The Transformer is a sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. It can be trained faster than architectures based on recurrent or convolutional layers and has shown state-of-the-art performance in translation tasks.'),
  HumanMessage(content='What are transformers?'),
  AIMessage(content='The Transformer is a sequence transduction model based entirely on attention, replacing recurrent layers with multi-headed self-attention. It can be trained faster than architectures based on recurrent or convolutional layers and has shown state-of-the-art performance in translation tasks.'),
  HumanMessage(content='What are transformers?'),
  AIMessage(content='The Transformer model is a sequence transduction model based on attention, replacing recurrent layers with multi-headed self-attention. It can be t

In [63]:
from langchain.callbacks.streaming_aiter import AsyncIteratorCallbackHandler
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler
from langchain.schema import LLMResult

class AsyncCallbackHandler(AsyncIteratorCallbackHandler):
    content: str = ""
    final_answer: bool = False
    
    def __init__(self) -> None:
        super().__init__()

    async def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        self.content += token
        # if we passed the final answer, we put tokens in queue
        if self.final_answer:
            if '"action_input": "' in self.content:
                if token not in ['"', "}"]:
                    self.queue.put_nowait(token)
        elif "Final Answer" in self.content:
            self.final_answer = True
            self.content = ""
    
    async def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:
        if self.final_answer:
            self.content = ""
            self.final_answer = False
            self.done.set()
        else:
            self.content = ""

In [19]:
import asyncio

async def run_call(query: str, stream_it: AsyncCallbackHandler):
    # assign callback handler
    agent.agent.llm_chain.llm.callbacks = [stream_it]
    # now query
    await agent.acall(inputs={"input": query})

async def create_gen(query: str, stream_it: AsyncCallbackHandler):
    task = asyncio.create_task(run_call(query, stream_it))
    async for token in stream_it.aiter():
        yield token
    await task


stream_it = AsyncCallbackHandler()
gen = create_gen("Who are you", stream_it)

In [20]:
chunks = []
for chunk in gen:
    chunks.append(chunk)
    print("------")
    print(chunk)

TypeError: 'async_generator' object is not iterable

In [36]:
agent.astream({"input": "Where is Fly Devs located?"})

<async_generator object AgentExecutor.astream at 0x7f371709c280>

In [17]:
agent.invoke({"input": "Where is Fly Devs located?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "I am Assistant, a large language model trained by Fly Devs to assist with a wide range of tasks and provide information on various topics."
}
```[0m

[1m> Finished chain.[0m


{'input': 'who are you?',
 'chat_history': [HumanMessage(content='Wher is Fly Devs locatead?'),
  AIMessage(content='Fly Devs has locations in Argentina, Spain, the United States, and Australia.'),
  HumanMessage(content='who are you?'),
  AIMessage(content='I am Assistant, a large language model trained by Fly Devs to assist with a wide range of tasks and provide information on various topics.')],
 'output': 'I am Assistant, a large language model trained by Fly Devs to assist with a wide range of tasks and provide information on various topics.'}

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("../data/unstructured/pdfs/atenttion_is_all_you_need.pdf")
data = loader.load()
print(f"This document has {len(data)} pages")


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    length_function=len,
    chunk_overlap=50
)

documents = text_splitter.split_documents(data)
print(f"Now we have {len(documents)} documents!")

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Instantiate OpenAI Embeddings
embedding_openai = OpenAIEmbeddings(model = "text-embedding-ada-002")

# Create Vector Store with Embeddings
vector_store = Chroma.from_documents(
    documents = documents,
    embedding = embedding_openai
)
retriever = vector_store.as_retriever()

This document has 15 pages
Now we have 93 documents!


In [5]:
from langchain.tools.retriever import create_retriever_tool
retriever_tool=create_retriever_tool(retriever,"documents_search",
                      "Search for information about attention is all you need!")

retriever_tool.name

tools=[retriever_tool]

In [6]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)

In [8]:
from langchain import hub
# Get the prompt to use - you can modify this!
prompt = hub.pull("hwchase17/openai-functions-agent")
prompt.messages

[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='You are a helpful assistant')),
 MessagesPlaceholder(variable_name='chat_history', optional=True),
 HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}')),
 MessagesPlaceholder(variable_name='agent_scratchpad')]

In [14]:
prompt.messages[0].

SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='You are a helpful assistant'))

In [9]:
from langchain.agents import create_openai_tools_agent
agent=create_openai_tools_agent(llm,tools,prompt)

In [10]:
from langchain.agents import AgentExecutor
agent_executor=AgentExecutor(agent=agent,tools=tools,verbose=True)
agent_executor

AgentExecutor(verbose=True, agent=RunnableMultiActionAgent(runnable=RunnableAssign(mapper={
  agent_scratchpad: RunnableLambda(lambda x: format_to_openai_tool_messages(x['intermediate_steps']))
})
| ChatPromptTemplate(input_variables=['agent_scratchpad', 'input'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]], 'agent_scratchpad': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'openai-functions-agent', 'lc_hub_commit_hash': 'a165

In [11]:

agent_executor.invoke({"input":"What is attention is all you need?"})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `documents_search` with `{'query': 'attention is all you need'}`


[0m[36;1m[1;3mbut
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14

but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top:
Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5
and 6. Note that the attentions are very sharp for this word.
14

but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>Figure 5: Many of th

{'input': 'What is attention is all you need?',
 'output': '"Attention is all you need" seems to be related to a concept in a document discussing attention heads and their application in anaphora resolution. The document mentions that the application of attention should be just what is missing. It also refers to attention heads exhibiting behavior related to the structure of sentences. If you need more detailed information or clarification, please let me know.'}