# Information Retrieval from PDF Documents

**PyPDF2 to read and extract text from the PDF**

In [2]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


# Classes and Their Responsibilities
**PDFProcessor Class**

 *Handles PDF file validation and text extraction.
Encapsulates the logic for interacting with the PDF file.
Attributes:*



```

pdf_path (str): # Path to the PDF file.
text (str):  # Extracted text from the PDF.
```



**Methods:**



```
validate_pdf(): # Checks if the file exists and is a valid PDF.
extract_text(): # Reads the PDF and extracts text from its pages.
```



**KeywordSearcher Class**

 ***Handles keyword searching within the extracted text.
Encapsulates the logic for splitting text into sentences and finding matches.
Attributes:***



```
text (str): # Extracted text to search within.
keywords (list): # List of keywords to search for.
```


**Methods:**

`search(): # Splits text into sentences and finds sentences containing the keywords.`

**PDFKeywordSearchApp Class**

Orchestrates the overall workflow of the program.
Provides the user interface for file input and keyword searching.
Methods: *italicized text*

`run(): Main method that drives the program execution.`

# Step 1: Import Libraries

In [None]:
import PyPDF2
import re
import os

# Step 2: Define the PDFProcessor Class


In [None]:
class PDFProcessor:
    def __init__(self, pdf_path):
        """Initialize with the path to the PDF file."""
        self.pdf_path = pdf_path
        self.text = ""  # Placeholder for extracted text

    def validate_pdf(self):
        """Check if the file exists and is a valid PDF."""
        if not os.path.isfile(self.pdf_path):
            raise FileNotFoundError("File not found. Please ensure the path is correct.")
        if not self.pdf_path.lower().endswith('.pdf'):
            raise ValueError("The selected file is not a PDF.")

    def extract_text(self):
        """Extract text from the PDF file."""
        with open(self.pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page in reader.pages:
                self.text += page.extract_text() + "\n"


# Step 3: Define the KeywordSearcher Class

In [None]:
class KeywordSearcher:
    def __init__(self, text, keywords):
        self.text = text
        self.keywords = keywords

    def search(self):
        """Search for keywords in text and return matching sentences."""
        results = {}
        # Simplified sentence splitting
        sentences = re.split(r'(?<=[.!?])\s+', self.text)
        for keyword in self.keywords:
            results[keyword] = [sentence for sentence in sentences if keyword.lower() in sentence.lower()]
        return results

# Step 4: Define the PDFKeywordSearchApp Class

In [None]:
class PDFKeywordSearchApp:
    def run(self):
        try:
            pdf_path = input("Enter the full path to your PDF file: ")
            processor = PDFProcessor(pdf_path)
            processor.validate_pdf()
            processor.extract_text()

            keywords = input("Enter keywords to search (comma-separated): ").split(',')
            searcher = KeywordSearcher(processor.text, keywords)
            results = searcher.search()

            print("\nSearch Results:")
            for keyword, sentences in results.items():
                print(f"\nKeyword: {keyword.strip()}\n")
                for sentence in sentences:
                    print(f"- {sentence.strip()}")
        except (FileNotFoundError, ValueError) as e:
            print(f"Error: {e}")

# Step 5: Add the Entry Point

In [9]:
if __name__ == "__main__":
    app = PDFKeywordSearchApp()
    app.run()

Enter the full path to your PDF file: /content/CSCM13 2022_8572818.PDF
Enter keywords to search (comma-separated): (0 <= X

Search Results:

Keyword: (0 <= X

- (2) (Z + Y = X + Aux ^0<= Aux ^Aux<= Y^ :(Aux = 0) ^Aux1 = Aux - 2 ^Z1 = Z - 1)
!(Z1 + Y = X + Aux1 ^0<= Aux1 ^Aux1 <= Y)
(3) (0 <= X^X<= 1000 ^0<= Y^Y<= 1000 ^Aux = Y ^Z = X )
!(Z + Y = X + Aux ^0<= Aux ^Aux<= Y)
(d) Which of the generated verication conditions are universally true?
