<a href="https://colab.research.google.com/github/mgorinova/language-modelling/blob/main/rag-with-langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Prelims

In [None]:
!pip install langchain langchainhub sentence_transformers

!pip install pypdf pdfminer.six
!pip install chromadb
!pip install openai

In [2]:
import os

Text wrapping of output cells, so that content is more readable.

In [3]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Overview
The goal of this Retrieval-Augmented Generation (RAG) task is to outline a prototype of a pipeline where we can ask natural language questions about the information in one of more input documents. The output should help answer the user's question, with a citation of the section of the document where this information is.

This notebook uses Python and LangChain to implement such a prototype as follows:
1. Extraction: Text is extracted from the .pdf documents and split into cunks.
2. Embedding: The chunks are passed through a LLM to create embeddings. The embeddings are stored in a vector database (VB).
3. Query the VB: Questions from the user and embedded with the same language model. We retrieve the most similar (in terms of embeddings) chunks from the VB.
4. Prompt: The text of the chunks is incorporated into a LLM prompt as a context. The prompt also asks the question provided by the user.
5. Assamble response.

## 1. Extraction

Use LangChain's document loaders to load the pdf files provided. The code assumes that all pdf files to be processed are in the local folder `data/`.

For this example, I included two papers: the 2017 NeurIPS paper "Attention is All You Need", and the 2021 ICCV "Swin Transformer" paper.

In [4]:
from langchain.document_loaders.pdf import PyPDFLoader

data_path = "data/"

pages = []
for file_name in os.listdir(data_path):
  file_path = os.path.join(data_path, file_name)
  loader = PyPDFLoader(file_path)
  pages.extend(loader.load())

Split the text into chunks. Using appropriate separateros, shunk sizes and overlap size.

In [63]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100, add_start_index=True)

splits = text_splitter.split_documents(pages)

print(splits[2].metadata)
print(splits[2].page_content)

{'source': 'data/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf', 'page': 0, 'start_index': 1842}
and models are publicly available at https://github.
com/microsoft/Swin-Transformer .
1. Introduction
Modeling in computer vision has long been dominated
by convolutional neural networks (CNNs). Beginning with
AlexNet [35] and its revolutionary performance on the
ImageNet image classiﬁcation challenge, CNN architec-
*Equal contribution.†Interns at MSRA.‡Contact person.
Figure 1. (a) The proposed Swin Transformer builds hierarchical
feature maps by merging image patches (shown in gray) in deeper
layers and has linear computation complexity to input image size
due to computation of self-attention only within each local win-
dow (shown in red). It can thus serve as a general-purpose back-
bone for both image classiﬁcation and dense recognition tasks.
(b) In contrast, previous vision Transformers [19] produce fea-
ture maps of a single low resolu

## 2. Embeddings

Embed the chunks using an LLM (in this case using HuggingFace `sentence_transformers` embedding models; with the default model [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)).

In [60]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

## 3. Query the Vector Database

Next, we assume a question from the user and query the VB based on it to retrieve the top 5 most similar chunks in terms of semantics.

I prepared 3 questions for this task: one relating to Document 1 (Attention is All You Need), one relating to Document 2 (Swin), and one relating to both.

In [64]:
#question = "Can I use self-attention for natural language processing tasks?"
#question = "Can I use self-attention for computer vision tasks?"
question = "What is Transformer?"

retrieved_chunks = vectorstore.similarity_search(query=question, k=5)

[doc.metadata for doc in retrieved_chunks]

[{'page': 2,
  'source': 'data/NIPS-2017-attention-is-all-you-need-Paper.pdf',
  'start_index': 0},
 {'page': 4,
  'source': 'data/NIPS-2017-attention-is-all-you-need-Paper.pdf',
  'start_index': 503},
 {'page': 1,
  'source': 'data/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf',
  'start_index': 0},
 {'page': 1,
  'source': 'data/NIPS-2017-attention-is-all-you-need-Paper.pdf',
  'start_index': 3529},
 {'page': 1,
  'source': 'data/NIPS-2017-attention-is-all-you-need-Paper.pdf',
  'start_index': 3529}]

In [66]:
print(retrieved_chunks[2].page_content)

In this paper, we seek to expand the applicability of
Transformer such that it can serve as a general-purpose
backbone for computer vision, as it does for NLP and
as CNNs do in vision. We observe that signiﬁcant chal-
lenges in transferring its high performance in the language
domain to the visual domain can be explained by differ-
ences between the two modalities. One of these differ-
ences involves scale. Unlike the word tokens that serve
as the basic elements of processing in language Trans-
formers, visual elements can vary substantially in scale, a


## 4. Prompt

We create a prompt for a LLM chatbot (in this case ChatGPT 3.5). The prompt incorporates the retrieved contexts and asks the chatbot to answer the question based on those contexts, by also providing a citation --- which contexts were the most influential for the construction of the chatbot's answer.

In [32]:
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "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"
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [67]:
from langchain.schema.document import Document

def format_prompt(question: str, documents: list[Document]) -> str:
  """Function to create the LLM prompt based on the question and list of
  retrieved documents."""

  formatted_context = ""
  for i, d in enumerate(documents):
    formatted_context += (
      f"Context {i+1}, source file {d.metadata['source']}, "
      f"page {d.metadata['page']}, "
      f"location {d.metadata['start_index']}:\n{d.page_content}\n")

  return f"""You are an assistant for question-answering tasks on documents.
Use the following pieces of retrieved context from several existing documents to
answer the question. If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.
Do not ask me questions, simply answer this query. State the IDs of the 1 to 3
contexts that provided the most significant information for answering the query.
Do this at the end of your response, in brackets like so: [2,3]

Question: {question}
{formatted_context}
Answer:"""


In [68]:
llm_input = format_prompt(documents=retrieved_chunks, question=question)
response = llm.invoke(input=llm_input)

In [69]:
print(response.content)

Transformer is a model architecture that employs a residual connection and layer normalization. It consists of an encoder and a decoder, both composed of stacked self-attention and fully connected layers. The Transformer is used in various domains such as NLP and computer vision. [1, 2, 3]


Extract citation IDs from the response.

**Note**: this has not been adapted to work in cases where citations were not returned, or the IDs were invalid.

Perhaps a better way of doing this would be to embed the response and display / cite the most similar to it chunks of text, instead of relying on the chatbot to tell us.

In [52]:
import re
citations = re.findall(r'\[(([1-9]+, ?)*([1-9]+))\].? ?$', response.content)[0][0]
context_ids = [int(c) for c in re.split(r'( ,)|,', citations) if c is not None]
context_ids

[1, 2, 3]

## 5. Assemble the response


In [70]:
context_texts = []
for context_id in context_ids:
  context_texts.append(retrieved_chunks[context_id-1].page_content)

In [71]:
print(f"Question: {question}\n")

print(f"Response: {response.content}\n")

for context_id, context_text in zip(context_ids, context_texts):
  context_metadata = retrieved_chunks[context_id-1].metadata
  print(
      f"[{context_id}] {context_metadata['source']}, "
      f"page {context_metadata['page']}, "
      f"start location {context_metadata['start_index']}")

  print(context_text)
  print()

Question: What is Transformer?

Response: Transformer is a model architecture that employs a residual connection and layer normalization. It consists of an encoder and a decoder, both composed of stacked self-attention and fully connected layers. The Transformer is used in various domains such as NLP and computer vision. [1, 2, 3]

[1] data/NIPS-2017-attention-is-all-you-need-Paper.pdf, page 2, start location 0
Figure 1: The Transformer - model architecture.
wise fully connected feed-forward network. We employ a residual connection [ 10] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm( x+ Sublayer( x)), where Sublayer(x)is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512 .
Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the tw