<a href="https://colab.research.google.com/github/karan2261/PDF_Chatbot_Project/blob/main/PDF_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is this Project about?
- This project is an interactive chatbot application powered by a large language model (LLM) for document-based question answering. The main goal was to create a tool that allows users to upload PDF documents and interactively ask questions about the content, leveraging advanced natural language processing techniques.
- The workflow is explained at the end, outlining how it works.

# Setting up the Environment

In [1]:
!pip install pytesseract
!sudo apt install tesseract-ocr
!pip install pdf2image
!pip install PyPDF2

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 

In [2]:
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
import PyPDF2
from transformers import pipeline
from google.colab import files
import io

# Function to Extract Text from PDFs


In [3]:
# Function to extract text from a PDF (with OCR support for scanned PDFs)
def extract_text_from_pdf(pdf_path, use_ocr=False):
    text = ""
    if use_ocr:
        # Convert PDF pages to images
        images = convert_from_path(pdf_path)
        for image in images:
            # Use Tesseract OCR to extract text from the image
            text += pytesseract.image_to_string(image)
    else:
        # Extract text from text-based PDFs
        with open(pdf_path, "rb") as file:
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text()

    return text

# Function to Answer Questions

In [4]:
# Function to answer questions using a QA model and enhanced prompt engineering
def answer_question(context, question, qa_model):
    # Example of a more structured prompt
    prompt = f"Based on the following text, answer the question in detail.\n\nContext: {context}\n\nQuestion: {question}\nAnswer:"

    # Use the model to get the answer
    return qa_model({"question": question, "context": context})["answer"]

# An Interactive Chatbot


In [5]:
# Interactive Chatbot with PDF Upload and Question Answering
def interactive_chatbot():
    # Upload PDF file using Google Colab's file upload widget
    uploaded = files.upload()

    if uploaded:
        # Get the uploaded PDF file name
        pdf_file = next(iter(uploaded))
        print(f"PDF file '{pdf_file}' uploaded successfully!")

        # Load the Hugging Face QA model
        qa_model = pipeline("question-answering", model="deepset/roberta-base-squad2")

        # Extract text from the uploaded PDF
        text = extract_text_from_pdf(pdf_file, use_ocr=True)  # Use OCR for scanned PDFs

        print("Chatbot is ready! Ask me anything from the document!")

        while True:
            user_question = input("You: ")

            if user_question.lower() in ["exit", "quit", "bye"]:
                print("Chatbot: Goodbye! Exiting...")
                break

            # Get the answer from the model using the enhanced prompt
            answer = answer_question(text, user_question, qa_model)
            print(f"Chatbot: {answer}")
    else:
        print("No PDF file uploaded!")

In [6]:
!apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.5 [186 kB]
Fetched 186 kB in 1s (210 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 123680 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Processing triggers for man-db (2.10.2-1) ...


# Running the Chatbot

In [7]:
# Run the chatbot
if __name__ == "__main__":
    interactive_chatbot()

Saving NLP doc file.pdf to NLP doc file.pdf
PDF file 'NLP doc file.pdf' uploaded successfully!


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Chatbot is ready! Ask me anything from the document!
You: What is tokenization?




Chatbot: breaking down a large
body of text into smaller units like words or sentences
You: What does sentiment analysis do?
Chatbot: determines the emotional tone behind a piece of text
You: What is the difference between rule-based and machine learning approaches in NLP?
Chatbot: bridging the
gap
You: "question": What are transformers in NLP?, "context": Transformers are deep learning architectures
Chatbot: Recurrent Neural Networks

You: what are transformers in NLP?
Chatbot: Speech recognition
systems
You: what are transformers in NLP?
Chatbot: Speech recognition
systems
You: what are transformers in NLP?
Chatbot: Speech recognition
systems
You: what is speech recognition?
Chatbot: transform spoken language into written text
You: how is transformer related to that?
Chatbot: The introduction of transformers has significantly improved NLP capabilities
You: continue
Chatbot: computational linguistics
You: how to ends this?
Chatbot: language
modeling
You: what are speech recognition sy

# Workflow

## 1. Text Extraction from PDFs:
- The application can handle both text-based and scanned PDFs. It uses PyPDF2 to directly extract text from standard PDFs and Tesseract OCR to process scanned documents, ensuring flexibility across different file types.

## 2. Question Answering with LLM:
- I am using a pretrained Hugging Face model (deepset/roberta-base-squad2) to answer questions based on the extracted text. Prompt engineering ensures the model provides accurate, detailed, and context-aware answers.

## 3. Interactive Chatbot:
The core of the project is the interactive chatbot interface. Users can:

* Upload a PDF document.
* Ask questions about its content in natural language.
* Receive precise answers based on the document's context.





# Conclusion:
- In conclusion, this project successfully demonstrates the power of large language models (LLMs) in creating an interactive chatbot that answers questions from PDF documents. By combining text extraction, OCR, and question-answering techniques, it offers a user-friendly experience without requiring technical expertise. The project also lays a strong foundation for future improvements, showcasing how scalable and adaptable LLMs can be for document understanding and more complex applications. Overall, it highlights the practical potential of LLMs in solving real-world problems.






