<a href="https://colab.research.google.com/github/keithth/AI_Apps/blob/main/27b_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installation

In [1]:
!pip uninstall -y haystack-ai pdfminer.six pytesseract numpy

Found existing installation: haystack-ai 2.9.0
Uninstalling haystack-ai-2.9.0:
  Successfully uninstalled haystack-ai-2.9.0
Found existing installation: pdfminer.six 20231228
Uninstalling pdfminer.six-20231228:
  Successfully uninstalled pdfminer.six-20231228
[0mFound existing installation: numpy 1.24.4
Uninstalling numpy-1.24.4:
  Successfully uninstalled numpy-1.24.4


In [2]:
# %% [bash]
# Install the latest Haystack 2.x along with required dependencies.
!pip install -qU haystack-ai pydantic
!pip install numpy==1.24.4

!pip install -qU pdfminer.six==20231228


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.3.17 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.2.2 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.2 which is incompatible.
thinc 8.2.5 requires numpy<2.0.0,>=1.19.0; python_version >= "3.9", but you have numpy 2.2.2 which is incompatible.
pytensor 2.26.4 requires numpy<2,>=1.17.0, but you have numpy 2.2.2 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.2 which is incompatible.
numba 0.61.0 requires numpy<2.2,>=1.24, but you have numpy 2.2.2 which is incompatible.[0m[31m
[0mCollecting numpy==1.24.4
  Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_

In [3]:
!pip show haystack-ai pdfminer.six pytesseract | grep Version | cut -d: -f2

[0m[31mERROR: Operation cancelled by user[0m[31m
[0m

## Key & logs

In [1]:
from google.colab import userdata
openai_api_key = userdata.get('OPENAI_API_KEY')

In [2]:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

# code1 Indexing Pipeline (for a PDF)

In [3]:
!pip install pypdf


Collecting pypdf
  Downloading pypdf-5.2.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.2.0-py3-none-any.whl (298 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m297.0/298.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.7/298.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.2.0


In [4]:
# Install the haystack-ai package (make sure to remove any farm-haystack installations)
# !pip install haystack-ai

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument  # new PDF converter component
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter

# Initialize a document store (the API is mostly the same but now imported from haystack.document_stores.in_memory)
document_store = InMemoryDocumentStore()

# Create the components:
pdf_converter = PyPDFToDocument()  # converts PDF files to Documents
cleaner = DocumentCleaner(remove_empty_lines=True, remove_extra_whitespaces=True)
splitter = DocumentSplitter(split_by="passage", split_length=100, split_overlap=20)
writer = DocumentWriter(document_store)

# Build the indexing pipeline:
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", pdf_converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

# Connect the components explicitly:
indexing_pipeline.connect("converter.documents", "cleaner.documents")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "writer.documents")


INFO:haystack.telemetry._telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


<haystack.core.pipeline.pipeline.Pipeline object at 0x7d12a3fa5a10>
🚅 Components
  - converter: PyPDFToDocument
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - writer: DocumentWriter
🛤️ Connections
  - converter.documents -> cleaner.documents (List[Document])
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> writer.documents (List[Document])

In [5]:

# Run the pipeline on your PDF file(s)
# (Assume file_paths is a list containing the path to your PDF file)
file_paths = ["/content/drive/MyDrive/A-A-ML/haystack/SEPMSpecialPublication2012Sylvester.pdf"]  # replace with your actual file path(s)
index_result = indexing_pipeline.run({"converter": {"sources": file_paths}})
print("Indexing completed.")


INFO:haystack.core.pipeline.base:Warming up component splitter...
INFO:haystack.core.pipeline.pipeline:Running component converter
INFO:haystack.core.pipeline.pipeline:Running component cleaner
INFO:haystack.core.pipeline.pipeline:Running component splitter
INFO:haystack.core.pipeline.pipeline:Running component writer


Indexing completed.


# code2 Query Pipeline (Extractive QA)

In [6]:
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.readers import ExtractiveReader

# Create a retriever – note that in 2.0 the retriever is now a component attached to a specific DocumentStore.
retriever = InMemoryBM25Retriever(document_store=document_store)

# Create a reader (e.g. using deepset's RoBERTa model fine‑tuned on SQuAD2)
reader = ExtractiveReader(model="deepset/roberta-base-squad2")

# Build the query (QA) pipeline:
qa_pipeline = Pipeline()
qa_pipeline.add_component("retriever", retriever)
qa_pipeline.add_component("reader", reader)

# Connect the retriever output to the reader input
qa_pipeline.connect("retriever", "reader")

# Now run a query:
query = "What is the main topic of the PDF?"
qa_result = qa_pipeline.run({
    "retriever": {"query": query, "top_k": 10},
    "reader": {"query": query, "top_k": 5}
})


INFO:haystack.core.pipeline.base:Warming up component reader...
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.core.pipeline.pipeline:Running component retriever
INFO:haystack.core.pipeline.pipeline:Running component reader


In [7]:
# Print the answers:
for answer in qa_result["reader"]["answers"]:
    print("Answer:", answer.data)  # use the 'data' attribute for the answer text
    print("Score:", answer.score)
    print()


Answer: 
lobe 2 lobe
Score: 0.539194643497467

Answer: 
3
4
5
6
7
8
9
10
11
12
13
14

Score: 0.536962628364563

Answer: channel evolution

Score: 0.5279945135116577

Answer: 
2

Score: 0.5272735357284546

Answer: Fuji–Einstein delta
Score: 0.5234811902046204

Answer: None
Score: 0.02268666060195595



In [8]:
# Assuming qa_result is defined and contains a key "reader" with "answers"
scores = []
for answer in qa_result["reader"]["answers"]:
    # Use the 'data' attribute (as per the ExtractedAnswer data class) for the answer text.
    answer_text = answer.data
    score = answer.score
    scores.append(score)
    print("Answer:", answer_text)
    print("Score:", score)
    print()  # blank line for readability

# Print the range of scores if any answers were returned.
if scores:
    print("Score Range: min = {:.4f}, max = {:.4f}".format(min(scores), max(scores)))
else:
    print("No answers found.")


Answer: 
lobe 2 lobe
Score: 0.539194643497467

Answer: 
3
4
5
6
7
8
9
10
11
12
13
14

Score: 0.536962628364563

Answer: channel evolution

Score: 0.5279945135116577

Answer: 
2

Score: 0.5272735357284546

Answer: Fuji–Einstein delta
Score: 0.5234811902046204

Answer: None
Score: 0.02268666060195595

Score Range: min = 0.0227, max = 0.5392


In [9]:
for answer in qa_result["reader"]["answers"]:
    print("Available attributes:", dir(answer))  # Print available attributes
    print("Answer:", answer)  # Access the answer text
    # print()
    print("Score:", answer.score)
    # print()
    print("Context:", answer.context)  # Access the surrounding context
    # print()
    print("Document:", answer.document) # Access the original document
    print()
    # ... explore other attributes ...

Available attributes: ['Span', '__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'context', 'context_offset', 'data', 'document', 'document_offset', 'from_dict', 'meta', 'query', 'score', 'to_dict']
Answer: ExtractedAnswer(query='What is the main topic of the PDF?', score=0.539194643497467, data='\nlobe 2 lobe', document=Document(id=eb531b25786747ca91335c30e8a8f823db7cd87d6e58eafafd9fd70df6e98305, content: 'See	discussions,	stats,	and	author	profiles	for	this	publication	at: https://www.researchgate.net/pu...', meta: {'file_path': 'SEPMSpecialPublication2012Sylvester.pdf', 'source_id': '30bd539455a

In [10]:

# Print the answers:
for answer in qa_result["reader"]["answers"]:
    print("Answer:", answer.answer)
    print("Score:", answer.score)


AttributeError: 'ExtractedAnswer' object has no attribute 'answer'