<a href="https://colab.research.google.com/github/michaelwnau/consequential-products/blob/main/mindflayer_v1_0_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Mindflayer: Analyzing Scientific Articles with OCR and Multimodal Models**

###This Colab Notebook demonstrates how to:

- Ingest and extract text from a PDF scientific article.
- Extract images, figures, and tables from the PDF.
- Apply OCR to images to extract embedded text.
- Use multimodal models to generate captions for images.
- Display and interpret the extracted content.



```
# This is formatted as code
```

##**Problem Statement**


> "OBJECTIVE:
>The Defense Threat Reduction Agency (DTRA) seeks to develop AI/ML models and pipelines capable of identifying, extracting, and processing elements from scanned technical and scientific documents. This project aims to automate the extraction of tables, plots, photos, and other elements embedded within structured and unstructured text, ensuring high fidelity and accuracy in a production environment."

Source: "DTRA243-003 AI/ML Data Extraction from Scientific Documents." See the [full problem statement](https://www.dodsbirsttr.mil/submissions/api/public/download/solicitationDocuments?solicitation=DOD_SBIR_2024_P1_C3&documentType=INSTRUCTIONS&component=DTRA) here.




##**Table of Contents**
1. Setup
2. Ingest the article
3. Extract the images from the PDF
4. Apply OCR to Extract Text from Images
5. Interpret Images Using Multimodal Models
6. Extract and Interpret Tables
7. Display Results
8. Advanced Interpretation (Optional)

##**Installing Tesseract OCR**

---



Tesseract OCR is required for text extraction from images.

Download and install it in your local environment using documentation found here: [Tesseract repo](https://github.com/tesseract-ocr/tesseract).

Install Tesseract OCR

`!sudo apt-get install tesseract-ocr`

Set the Tesseract command path (Colab usually has it installed in /usr/bin/tesseract)

`pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'`




##**Mount Google Drive**
For this notebook, we will use Google Drive as our document storage repository. Run this cell and follow the prompts to connect the drive. If you have a document collection already prepared, select it or use the sample document attached here for demostration.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


##**1. Setup**
First, let's install the necessary libraries by running the cell.

In [None]:
# Install libraries
!pip install PyPDF2
!pip install PyMuPDF
!pip install pytesseract
!pip install Pillow
!pip install transformers
!pip install torch

In [None]:
# Import the libraries
import PyPDF2
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch
import os
from IPython.display import Image as IPythonImage, display
import pandas as pd
import io

##**2. Ingest the article**
With your Google Drive mounted in the setup steps, you will now set the file path to the article you want to ingest. This demonstration targets a single article but with slight modification it is possible to batch ingest files.

In [None]:
# Replace this path with the path to your PDF file in Google Drive
pdf_file_path = '/content/drive/MyDrive/technical-manuals-training-data/AEROSPACE-EQUIPMENT-MAINTENANCE-INSPECTION-to-00-20-1.pdf'

###Extract Text from the PDF

In [None]:
# Open the PDF file from Google Drive
pdf_file = open(pdf_file_path, 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)

# Extract text from each page
text = ''
for page in pdf_reader.pages:
    page_text = page.extract_text()
    if page_text:
        text += page_text

pdf_file.close()

# Display a snippet of the extracted text
print("Extracted Text Snippet:\n")
print(text[:500])


#**3. Extract Images from the PDF**



In [None]:
# Open the PDF file with PyMuPDF from Google Drive
doc = fitz.open(pdf_file_path)

# Create a directory to store images
os.makedirs('extracted_images', exist_ok=True)

# List to store image file paths
image_paths = []

# Iterate through each page to extract images
for page_index in range(len(doc)):
    page = doc[page_index]
    image_list = page.get_images(full=True)

    print(f"Found {len(image_list)} images on page {page_index+1}")

    for img_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]
        image_name = f"image_page{page_index+1}_{img_index}.{image_ext}"
        image_path = os.path.join('extracted_images', image_name)

        # Save the image
        with open(image_path, "wb") as image_file:
            image_file.write(image_bytes)
            image_paths.append(image_path)


##**4. Apply OCR to Extract Text from Images**

In [None]:
# Dictionary to store OCR results
ocr_results = {}

for image_path in image_paths:
    # Open the image file
    img = Image.open(image_path)

    # Perform OCR
    text = pytesseract.image_to_string(img)

    # Store the result
    ocr_results[image_path] = text


##**5. Interpret Images Using A Multimodal Model**

We are using BLIP for image interpretation and captioning. BLIP is an open-source, pre-trained model from Salesforce ([model card](https://huggingface.co/Salesforce/blip-image-captioning-base)).

In [None]:
# Load the processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

###Generate Captions for Images
BLIP generates terse responses for efficiency but this can be customized with prompts to generate longer captions including adding alt-text for accessibility.

In [None]:
# Dictionary to store captions
captions = {}

for image_path in image_paths:
    raw_image = Image.open(image_path).convert('RGB')

    # Prepare the image for the model
    inputs = processor(raw_image, return_tensors="pt").to(torch.device('cpu'))

    # Generate caption
    out = model.generate(**inputs)
    caption = processor.decode(out[0], skip_special_tokens=True)

    # Store the caption
    captions[image_path] = caption


##**6. Extract and Interpret Tables**
If tables are images, we can attempt to parse them from the OCR text.

In [None]:
for image_path, ocr_text in ocr_results.items():
    if 'table' in image_path.lower():
        print(f"OCR Text for {image_path}:\n{ocr_text}\n")
        # Attempt to parse the table
        try:
            df = pd.read_csv(io.StringIO(ocr_text))
            print(f"DataFrame for {image_path}:\n{df}\n")
        except:
            print(f"Could not parse table from {image_path}\n")


##**7. Display Results**
Display Images with Captions and OCR Text

In [None]:
for image_path in image_paths:
    print(f"Image: {image_path}")
    display(IPythonImage(filename=image_path))
    print(f"Caption: {captions.get(image_path, 'No caption available')}\n")
    print(f"OCR Text:\n{ocr_results.get(image_path, 'No OCR text available')}\n")
    print("-" * 50)


##**8. Advanced Interpretation (Optional)**
Summarize the Extracted Text

In [None]:
from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization")

# Due to token limits, you may need to summarize in chunks.
chunk_size = 1000  # Adjust based on your needs
text_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

summaries = []
for chunk in text_chunks:
    summary = summarizer(chunk, max_length=150, min_length=40, do_sample=False)
    summaries.append(summary[0]['summary_text'])

# Combine summaries
full_summary = ' '.join(summaries)
print("Summary of the Article:\n")
print(full_summary)


#Notes and Considerations
- Tesseract OCR: Colab comes with Tesseract OCR installed, but if you encounter any issues, ensure it's installed and properly configured.

- GPU Support: If you're working with large models or need faster computation, consider enabling GPU acceleration in Colab: Runtime > Change runtime type > Hardware accelerator > GPU.

- Model Performance: The pre-trained models may not perfectly interpret complex scientific images or tables. For better results, consider fine-tuning models or using specialized models.

- Error Handling: Ensure to handle exceptions, especially when dealing with OCR and parsing, to avoid interruptions in the workflow.

- Data Privacy: Be cautious when uploading proprietary or sensitive documents to Colab, as they are processed on external servers.