<a href="https://colab.research.google.com/github/klmahalakshmi0102/PDFChatbot/blob/master/pdfchatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installs the required libraries: PyPDF2 for PDF reading, sentence-transformers for sentence embeddings, transformers for NLP models, and scikit-learn for machine learning utilities.

In [None]:
!pip install PyPDF2 sentence-transformers transformers scikit-learn



In [None]:
# Imports the necessary libraries and modules for the script.
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.neighbors import NearestNeighbors
import numpy as np

Reads the content of the PDF file and extracts the text from each page.

In [None]:
pdf_path = '/content/Lumbini_resume_new.pdf'

In [None]:
pdfreader = PdfReader(pdf_path)
raw_text = ''
for page in pdfreader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

The split_text function is designed to split a large text into smaller, overlapping chunks. This can be useful for processing long documents where maintaining some overlap between chunks ensures context continuity.

In [None]:
def split_text(text, chunk_size=600, chunk_overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - chunk_overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

In [None]:
texts = split_text(raw_text)

Encodes the text chunks into embeddings using a pre-trained SentenceTransformer model.










In [None]:

# Check if raw_text is empty and print a message if it is
if not raw_text:
    print("Warning: No text extracted from the PDF.")

texts = split_text(raw_text)

# Check if texts is empty and print a message if it is
if not texts:
    print("Warning: No text chunks generated.")

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)  # This line should now work if texts is not empty

In [None]:
embeddings = model.encode(texts)

Fits a Nearest Neighbors model to the text embeddings to find the most relevant text chunks for a given query.



In [None]:
nn_model = NearestNeighbors(n_neighbors=5, metric='cosine')
nn_model.fit(embeddings)

Loads a pre-trained question-answering model using the Hugging Face transformers library.

In [None]:
qa_pipeline = pipeline('question-answering', model='deepset/roberta-base-squad2')


This function takes a query, finds the most relevant text chunks using the Nearest Neighbors model, concatenates these chunks, and then uses the question-answering pipeline to find the answer in the concatenated context.

In [None]:
def run_query(query, texts, embeddings, nn_model, qa_pipeline, max_answer_len=500):
    query_embedding = model.encode([query])
    distances, indices = nn_model.kneighbors(query_embedding)
    relevant_texts = [texts[i] for i in indices[0]]
    context = " ".join(relevant_texts)
    result = qa_pipeline(question=query, context=context, max_answer_len=max_answer_len)
    return result['answer']

In [None]:
queries = [
    "what is the GPA in post graduation",
    "what are the certifications"

]

In [None]:
# queries = ["IMPORTANT:"]

In [None]:
for query in queries:
    answer = run_query(query, texts, embeddings, nn_model, qa_pipeline)
    print(f"Query: {query}\nAnswer: {answer}\n")

Query: what is the GPA in post graduation
Answer: 6.80 /10

Query: what are the certifications
Answer: Object-Oriented Data Structures in C++

