<a href="https://colab.research.google.com/github/keithth/AI_Apps/blob/main/27a_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/logging)

## Installation

In [10]:
!pip uninstall -y haystack-ai pdfminer.six pytesseract numpy

Found existing installation: haystack-ai 2.9.0
Uninstalling haystack-ai-2.9.0:
  Successfully uninstalled haystack-ai-2.9.0
Found existing installation: pdfminer.six 20231228
Uninstalling pdfminer.six-20231228:
  Successfully uninstalled pdfminer.six-20231228
[0mFound existing installation: numpy 2.2.2
Uninstalling numpy-2.2.2:
  Successfully uninstalled numpy-2.2.2


In [1]:
# %% [bash]
# Install the latest Haystack 2.x along with required dependencies.
!pip install -qU haystack-ai pydantic
!pip install numpy==1.24.4

!pip install -qU pdfminer.six==20231228




In [2]:
!pip show haystack-ai pdfminer.six pytesseract | grep Version | cut -d: -f2

[0m 2.9.0
 20231228


## Key & logs

In [3]:
from google.colab import userdata
openai_api_key = userdata.get('OPENAI_API_KEY')

In [4]:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# test

In [5]:

# %% [python]
# Import required modules using the updated paths
from haystack.document_stores.in_memory import InMemoryDocumentStore

In [6]:

from haystack.components.converters.pdfminer import PDFMinerToDocument
from haystack.components.converters.pdfminer import PDFMinerToDocument

# converter = PDFMinerToDocument()
# results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
# documents = results["documents"]
# print(documents[0].content)
# # 'This is a text from the PDF file.'

In [7]:
from haystack import Document
from haystack.components.readers import ExtractiveReader

In [8]:

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter

In [10]:

from haystack.components.retrievers import InMemoryEmbeddingRetriever

In [11]:

from haystack import Document
from haystack.components.readers import ExtractiveReader

In [12]:

from haystack import Pipeline, Document

# -----------------------------------------------------------------------------
# Step 1: Initialize the DocumentStore
# -----------------------------------------------------------------------------
document_store = InMemoryDocumentStore()


In [20]:

# -----------------------------------------------------------------------------
# Step 2: Convert a PDF into Documents
# -----------------------------------------------------------------------------
# Initialize the PDF converter. Note: PDFToTextConverter now supports parameters like
# remove_numeric_tables and valid_languages (see the API reference for PDF converters).
pdf_converter = PDFMinerToDocument()


# b1

In [21]:

# Update the file path to your PDF file.
pdf_file_path = "/content/drive/MyDrive/A-A-ML/haystack/SEPMSpecialPublication2012Sylvester.pdf"
documents = pdf_converter.convert(file_path=pdf_file_path)

# -----------------------------------------------------------------------------
# Step 3: Preprocess the Documents
# -----------------------------------------------------------------------------
# Clean and split the documents. The old monolithic PreProcessor has been replaced
# with dedicated components: DocumentCleaner and DocumentSplitter.
cleaner = DocumentCleaner(remove_empty_lines=True, remove_extra_whitespaces=True)
splitter = DocumentSplitter(split_by="sentence", split_length=100, split_overlap=10)

# First, clean the documents.
cleaned_docs = cleaner.run(documents=documents)["documents"]
# Then, split the cleaned documents into smaller chunks.
processed_docs = splitter.run(documents=cleaned_docs)["documents"]

# Write the processed documents into the DocumentStore.
document_store.write_documents(processed_docs)

# -----------------------------------------------------------------------------
# Step 4: Initialize the Retriever and Update Embeddings
# -----------------------------------------------------------------------------
# In Haystack 2.x, dense retrievers are now specialized; for an in-memory dense retriever,
# use InMemoryEmbeddingRetriever.
retriever = InMemoryEmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True
)
document_store.update_embeddings(retriever)

# -----------------------------------------------------------------------------
# Step 5: Initialize the Reader
# -----------------------------------------------------------------------------
# Initialize FARMReader using a fine-tuned model.
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

# -----------------------------------------------------------------------------
# Step 6: Build the Extractive QA Pipeline
# -----------------------------------------------------------------------------
# Create a new pipeline (Haystack 2.x style uses add_component and connect).
pipeline = Pipeline()
pipeline.add_component("retriever", retriever)
pipeline.add_component("reader", reader)
pipeline.connect("retriever", "reader")

# -----------------------------------------------------------------------------
# Step 7: Run a Sample Query
# -----------------------------------------------------------------------------
query = "What is seismic stratigraphy?"
result = pipeline.run(data={
    "retriever": {"query": query, "top_k": 5},
    "reader": {"query": query, "top_k": 3}
})

# Print the results.
print(result)


AttributeError: 'PDFMinerToDocument' object has no attribute 'convert'

## import

In [None]:
!pip install pdfminer.six

In [None]:
# Import updated libraries using the new import paths per the migration guide
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PDFMinerToDocument
# from haystack.components.file_converter.pdf import PDFToTextConverter
# from haystack.components.preprocessor import PreProcessor
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.retrievers import EmbeddingRetriever
from haystack.components.readers.farm import FARMReader
from haystack import Pipeline



In [None]:
from haystack.components.file_converter.pdf import PDFToTextConverter

In [None]:
from haystack.nodes.file_converter.pdf import PDFToTextConverter


In [None]:
# Additional imports for image extraction
import pdfplumber
import os
from PIL import Image
import pytesseract


# def extract_illustrations

In [None]:

# --- Define a Function to Extract Illustrations and Captions from the PDF ---
def extract_illustrations(pdf_path, output_dir="extracted_images"):
    os.makedirs(output_dir, exist_ok=True)
    illustrations = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            # Check if the page contains images
            if page.images:
                for img in page.images:
                    # Extract bounding box (x0, top, x1, bottom)
                    bbox = (img['x0'], img['top'], img['x1'], img['bottom'])
                    # Convert the page to an image and crop the illustration region
                    page_image = page.to_image()
                    cropped_image = page_image.crop(bbox).original
                    # Save the cropped image to file
                    image_filename = os.path.join(output_dir, f"page_{page_num}_img_{img['object_id']}.png")
                    cropped_image.save(image_filename)
                    # Run OCR on the cropped image to extract any caption text
                    caption = pytesseract.image_to_string(cropped_image).strip()
                    illustrations.append({
                        "page": page_num,
                        "image_path": image_filename,
                        "caption": caption
                    })
    return illustrations


# code1

In [None]:

# --- Initialize the DocumentStore ---
document_store = InMemoryDocumentStore()

# --- Convert the PDF to Text ---
pdf_path = "/content/drive/MyDrive/A-A-ML/haystack/SEPMSpecialPublication2012Sylvester.pdf"
pdf_converter = PDFToTextConverter(remove_numeric_tables=False, valid_languages=["en"])
docs = pdf_converter.convert(file_path=pdf_path)

# --- Preprocess the Extracted Text ---
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    split_by="sentence",
    split_length=100,
    split_respect_sentence_boundary=True
)
processed_docs = preprocessor.process(docs)

# --- Write Preprocessed Text Documents to the DocumentStore ---
document_store.write_documents(processed_docs)

# --- Extract Illustrations and Their Captions from the PDF ---
illustrations = extract_illustrations(pdf_path)

# --- Create "Illustration" Documents from the Extracted Data ---
illustration_docs = []
for ill in illustrations:
    # Use the OCR caption if available; otherwise, use a default message.
    content = ill["caption"] if ill["caption"] else f"Illustration from page {ill['page']} (no caption detected)"
    doc = {
        "content": content,
        "meta": {
            "type": "illustration",
            "image_path": ill["image_path"],
            "page": ill["page"]
        }
    }
    illustration_docs.append(doc)

# --- Write Illustration Documents to the DocumentStore ---
document_store.write_documents(illustration_docs)

# --- Initialize the Retriever ---
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True
)
document_store.update_embeddings(retriever)

# --- Initialize the Reader ---
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

# --- Build the Extractive QA Pipeline ---
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)


# def summarize_content

In [None]:

# --- Function to Query and Summarize Content ---
def summarize_content(query):
    result = pipeline.run(
        query=query,
        params={"retriever": {"top_k": 15}, "reader": {"top_k": 5}}
    )
    summaries = []
    for answer in result["answers"]:
        summaries.append({
            "Answer": answer.answer,
            "Score": answer.score,
            "Context": answer.context
        })
    return summaries


# Geological Queries

In [None]:

# --- Example Geological Queries ---
queries = [
    "Summarize the text related to seismic stratigraphy.",
    "Describe the illustrations and their captions.",
    "Provide an overview of the key points in the paper."
]

# --- Collect and Display Summaries ---
all_summaries = {}
for query in queries:
    all_summaries[query] = summarize_content(query)

for key, summaries in all_summaries.items():
    print(f"\n{key}:")
    for summary in summaries:
        print(f"Answer: {summary['Answer']}, Score: {summary['Score']}, Context: {summary['Context']}")


# b2