<a href="https://colab.research.google.com/github/poushalisanyal/-PDF-Based-Semantic-Search-Engine/blob/main/RAG_IMPLEMENT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PDF-Based Q&A with LangChain and FAISS
This notebook demonstrates how to load a PDF file, split its content into semantic chunks, generate embeddings using HuggingFace, store them in a FAISS vector store, and perform semantic search for intelligent retrieval.

In [3]:
!pip install langchain langchain_community openai faiss-cpu sentence-transformers pypdf --upgrade --quiet

In [4]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

In [5]:
loader = PyPDFLoader("/content/Deep Learning notes.pdf")

In [6]:
docs = loader.load()

In [7]:
splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=300)

In [8]:
chunks = splitter.split_documents(docs)

In [9]:
embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
vectors = FAISS.from_documents(chunks, embeddings)

  return forward_call(*args, **kwargs)


In [11]:
retriever = vectors.as_retriever(search_kwargs={"k": 3})

In [12]:
query = "What is the main idea of this document?"
results = retriever.get_relevant_documents(query)

  results = retriever.get_relevant_documents(query)
  return forward_call(*args, **kwargs)


In [13]:
print(f"\n📌 Top results for: {query}")
for i, doc in enumerate(results, 1):
    print(f"\n--- Chunk {i} ---\n{doc.page_content}")


📌 Top results for: What is the main idea of this document?

--- Chunk 1 ---
12 
 
 
4.7.4. Applications of Computational Neuro Science: 
 
 Deep Learning, Artificial Intelligence and Machine Learning 
 Human psychology 
 Medical sciences 
 Mental models 
 Computational anatomy 
 Information theory 
 
                 
Reference Books: 
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.  
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education. 
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures,  Algorithms and 
Applications”, Prentice Hall publications. 
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015. 
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013. 
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016. 
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015. 
 
 
Note: For furt

In [14]:
query = "Tell me about the biological neuron in deep learning?"
results = retriever.get_relevant_documents(query)

  return forward_call(*args, **kwargs)


In [15]:
print("Answer")
print(results[0].page_content)

Answer
4 
 
Information flow in a neural cell 
The input/output and the propagation of information are shown below. 
1.3. Artificial neuron model 
An artificial neuron is a mathematical function conceived as a simple model of a real (biologic al) 
neuron. 
 The McCulloch-Pitts Neuron 
This is a simplified model of real neurons, known as a Threshold Logic Unit. 
 A set of input connections brings in activations from other neuron. 
 A processing unit sums the inputs, and then applies a non -linear activation function (i. e. 
squashing/transfer/threshold function). 
 An output line transmits the result to other neurons. 
1.3.1 Basic Elements of ANN: 
 Neuron consists of three basic components –weights, thresholds and a  single activation 
function.  An Artificial neural network(ANN) model based on the biological neural sytems is shown 
in figure 2. 
 
                      
                            Figure 2: Basic Elements of Artificial Neural Network 
 
1.4 Different Learning Rule

In [16]:
query = "Tell me about Gradient Descent?"
results = retriever.get_relevant_documents(query)

  return forward_call(*args, **kwargs)


In [17]:
print("Answer")
print(results[0].page_content)

Answer
15 
 
1.9.1. Types of Gradient Descent: 
Typically, there are three types of Gradient Descent: 
1. Batch Gradient Descent 
2. Stochastic Gradient Descent 
3. Mini-batch Gradient Descent 
1.9.2. Stochastic Gradient Descent (SGD): 
The word ‘stochastic‘ means a system or a process that is linked with a random probability. 
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole 
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total 
number of samples from a dataset that is used for calculating the gradient for each iteration. In typical 
Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole 
dataset. Although, using the whole dataset is re ally useful for getting to the minima in a less noisy 
and less random manner, but the problem arises when our datasets gets big.  
Suppose, you have a million samples in your dataset, so if you use a typical Gradie