# 📚 Multi-Document QA System using RAG, FAISS & Cohere


In this project, we build a **Retrieval-Augmented Generation (RAG)** system that can intelligently answer questions based on content from multiple document formats — including PDF, PowerPoint, and Word.

We use:
-  **HuggingFace MiniLM** for embeddings
-  **FAISS** for vector-based semantic search
-  **Cohere** LLM for generating accurate responses

This pipeline demonstrates how to create a production-ready, open-source QA system that understands your documents contextually.


###  Install Required Libraries


In [30]:
# Installing Required Libraries
%pip install python-docx
%pip install python-pptx
%pip install PyPDF2
%pip install langchain
%pip install langchain_community
%pip install langchain_google_genai
%pip install langchain_text_splitters
%pip install sentence-transformers
%pip install faiss-cpu
%pip install cohere

Collecting cohere
  Downloading cohere-5.15.0-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20250515-py3-none-any.whl.metadata (2.1 kB)
Downloading cohere-5.15.0-py3-none-any.whl (259 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.5/259.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20250515-py3-none-any.whl (20 kB)
Installing collected packages: types-requests, fastavro, cohere
Successfully installed cohere-5.15.0 fastavro-1.10.0 types-requests-2.32.0.20250515


### Import Required Python Modules

In [4]:
#  Imports
from docx import Document
from PyPDF2 import PdfReader
from pptx import Presentation
from langchain_community.llms import Cohere
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts  import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder

### Load PDF, PPTX, and DOCX Files


In [5]:
pdf_file = open('/content/NIPS-2017-attention-is-all-you-need-Paper.pdf','rb')
ppt_file = Presentation("/content/hyperacidity.pptx")
doc_file = Document('/content/Synopsis.docx')

### Extract Text from All Documents


In [6]:
# extracting pdf data
pdf_text = ""
pdf_reader = PdfReader(pdf_file)
for page in pdf_reader.pages:
    pdf_text += page.extract_text()

# extracting ppt data
ppt_text = ""
for slide in ppt_file.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            ppt_text += shape.text + '\n'

# extracting doc data
doc_text = ""
for paragraph in doc_file.paragraphs:
    doc_text += paragraph.text + '\n'

### Merge Extracted Text into One Corpus


In [7]:
# merging all the text

all_text = pdf_text + '\n' + ppt_text + '\n' + doc_text
len(all_text)

37856

### Split Text into Chunks for Embedding


In [8]:
# splitting the text into chunks for embeddings creation

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000,
        chunk_overlap = 200, # This is helpul to handle the data loss while chunking.
        length_function = len,
        separators=['\n', '\n\n', ' ', '']
    )

chunks = text_splitter.split_text(text = all_text)

In [9]:
len(chunks)

48

In [25]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
HF_TOKEN=userdata.get('HF_TOKEN')
COHERE_API_KEY=userdata.get('COHERE_API_KEY')

### Create Embeddings & Store in FAISS


In [12]:
# Initializing embeddings model

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
# Indexing the data using FAISS
vectorstore = FAISS.from_texts(chunks, embedding = embeddings)

### Setup Retriever from Vector Store


In [14]:
# creating retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [20]:
retrieved_docs = retriever.invoke("What is transformers?") # simple query

In [21]:
len(retrieved_docs)

6

In [22]:
print(retrieved_docs[0].page_content)

2Figure 1: The Transformer - model architecture.
wise fully connected feed-forward network. We employ a residual connection [ 10] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm( x+ Sublayer( x)), where Sublayer(x)is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512 .
Decoder: The decoder is also composed of a stack of N= 6identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This


### Answer Questions using RAG with Cohere

In [23]:
prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
                Context: \n {context}?\n
                Question: \n {question} \n
                Answer:"""

prompt = PromptTemplate.from_template(template=prompt_template)

In [24]:
# function to create a single string of relevant documents given by Faiss.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [28]:
# RAG Chain

def generate_answer(question):
    cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = COHERE_API_KEY)

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | cohere_llm
        | StrOutputParser()
    )

    return rag_chain.invoke(question)

In [31]:
ans = generate_answer("What is self attention?")
print(ans)

Ｏ) the positions in the
decoder that are to the left of the current position.
3.2.2 Multi-Head Attention

self-attention layer. The input to the multi-head attention layer consists of queries, keys and values
of dimension dk. The queries, keys and values are projected into dk, dimensions via linear
transformations, and are then split into k partitions of dimension dk/k. For each partition, we
compute a scaled dot-product attention layer as described above, yielding a result of dimension
dk. These k results are then concatenated along the depth dimension to produce the output of the
multi-head attention layer, of dimension dk.

The number of heads is a hyperparameter that can be set manually, and is usually set to be around
the square root of the dimensionality of the input to the attention layer.

The key benefit of multi-head attention is that it allows the model to jointly attend to information
at different representation granularities, or in different representational spaces. This i

In [32]:
ans = generate_answer("What is hyperacidity?")
print(ans)

 Hyperacidity is the excess secretion of hydrochloric acid in the stomach, which causes irritation, inflammation, and ulcers due to its contact with the mucosa. 


In [33]:
ans = generate_answer("What is my project about?")
print(ans)

 Your project aims to complete an analysis of Uber data to find key factors that can enhance the company's business. The analysis will use different columns of the dataset and try to find relationships between them. The project will use machine learning algorithms to predict prices and explore the effects of various factors like date, month, and weather. The goal is to provide insights and make recommendations to improve the business based on the data analysis. 


In [34]:
ans = generate_answer("How to deal with hyperacidity?")
print(ans)

 The provided text discusses hyperacidity in the context of Ayurvedic medicine. 

To deal with hyperacidity, or Amlapitta, in Ayurveda, one should follow a strict diet and general lifestyle guidelines as prevention is considered the best treatment. 

Dietary recommendations for people suffering from hyperacidity include drinking ample fluids, especially warm water, and consuming foods with cooling properties like coconut water. Bitter seasonal vegetables and fruits like gooseberry, dry grapes, black grapes, sweet lime, pomegranate, fig, and dry fig are also recommended. In addition, patients are advised to avoid spicy and sour foods, rice, curd, sour fruits, and bakery items, as well as fermented foods like bread, pickles, and maida. 

From a lifestyle perspective, patients are advised to practice yoga, pranayama, meditation, and exercise regularly. Getting enough rest and avoiding stress, anger, and overexposure to the sun are also listed as ways to help alleviate hyperacidity. 

Over

In [36]:
ans = generate_answer("what is decoder in transformer")
print(ans)

 The decoder in the Transformer is an auto-regressive component that is comprised of a stack of identical layers used to generate output symbols one at a time. Each layer includes three sub-layers, a multi-head self-attention mechanism, a simple, position-wise fully connected feed-forward network, and an additional sub-layer in the decoder that performs multi-head attention over the output of the encoder stack. 
