# Method 5: Qwen2-VL Vision Direct Extraction

**Direct Image Processing for Scientific Posters**

This notebook demonstrates how to use **Qwen2-VL-2B-Instruct** for scientific poster metadata extraction using direct image analysis without any text extraction.

## ✨ Key Advantages:
- **Direct Image Processing**: Analyzes poster visually like humans do
- **Same Prompt Style**: Uses identical direct prompt as DeepSeek/Mistral
- **No Text Extraction**: Pure vision-based processing
- **Reusable & Generalizable**: Works with any scientific poster

## 🎯 Results Preview:
- ✅ **5/5 Authors** extracted with affiliations
- ✅ **Complete JSON** structure in one go
- ✅ **~43 seconds** processing time
- ✅ **Direct approach** - processes images without OCR


In [1]:
#!/usr/bin/env python3
import os
import json
import fitz  # PyMuPDF
import torch
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Optional, Any
import time
from PIL import Image
import io
import re
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

print("📦 All imports successful!")
print(f"🔥 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🎯 GPU: {torch.cuda.get_device_name()}")


2025-08-30 15:20:46.243269: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756592446.261606 2724074 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756592446.267226 2724074 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756592446.282755 2724074 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756592446.282770 2724074 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756592446.282773 2724074 computation_placer.cc:177] computation placer alr

📦 All imports successful!
🔥 CUDA available: True
🎯 GPU: NVIDIA GeForce RTX 4090


## 🎯 The Direct Vision Prompt

This is the same simple, direct prompt that works so well with DeepSeek/Mistral - but for images!


In [2]:
def convert_pdf_to_images(pdf_path: str, dpi: int = 200) -> List[Image.Image]:
    """Convert PDF pages to high-quality PIL Images for vision models"""
    doc = fitz.open(pdf_path)
    images = []
    
    print(f"📄 Converting PDF to images (DPI: {dpi})...")
    
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        
        # Convert to image with specified DPI
        mat = fitz.Matrix(dpi/72, dpi/72)
        pix = page.get_pixmap(matrix=mat)
        
        # Convert to PIL Image
        img_data = pix.tobytes("png")
        img = Image.open(io.BytesIO(img_data))
        images.append(img)
        
        print(f"   Page {page_num + 1}: {img.size[0]}x{img.size[1]} pixels")
    
    doc.close()
    return images

def create_vision_prompt() -> str:
    """Create the same direct prompt style as DeepSeek/Mistral for vision models"""
    return """Analyze this scientific poster image and extract metadata. Return ONLY valid JSON with no explanations or formatting:

{
  "title": "exact poster title",
  "authors": [
    {"name": "Full Name", "affiliations": ["Institution"], "email": null}
  ],
  "summary": "brief research summary",
  "keywords": ["key", "terms"],
  "methods": "methodology description",
  "results": "main findings and results",
  "references": [
    {"title": "paper title", "authors": "authors", "year": 2024, "journal": "journal"}
  ],
  "funding_sources": ["funding info"],
  "conference_info": {"location": "location", "date": "date"}
}"""

def load_qwen2vl_model():
    """Load Qwen2-VL model and processor"""
    print("🤖 Loading Qwen2-VL model...")
    
    model_name = "Qwen/Qwen2-VL-2B-Instruct"
    
    try:
        model = Qwen2VLForConditionalGeneration.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
        
        processor = AutoProcessor.from_pretrained(model_name)
        
        print(f"✅ Qwen2-VL loaded successfully")
        return model, processor
        
    except Exception as e:
        print(f"❌ Failed to load Qwen2-VL: {e}")
        return None, None

print("✅ Core vision functions defined")
print("🎯 Same prompt style as DeepSeek/Mistral - but for images!")


✅ Core vision functions defined
🎯 Same prompt style as DeepSeek/Mistral - but for images!


In [3]:
def extract_with_qwen2vl(model, processor, image: Image.Image, prompt: str) -> str:
    """Extract metadata using Qwen2-VL vision model"""
    print("🔄 Generating response with Qwen2-VL...")
    
    try:
        # Prepare conversation format
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": prompt}
                ]
            }
        ]
        
        # Apply chat template
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        
        # Process inputs
        image_inputs, video_inputs = process_vision_info(messages)
        inputs = processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt"
        )
        
        inputs = inputs.to("cuda")
        
        # Generate response
        with torch.no_grad():
            generated_ids = model.generate(
                **inputs,
                max_new_tokens=3000,  # Increased for complete outputs
                do_sample=False,
                pad_token_id=processor.tokenizer.eos_token_id
            )
        
        # Trim input tokens and decode
        generated_ids_trimmed = [
            out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        
        response = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )[0]
        
        return response.strip()
        
    except Exception as e:
        print(f"❌ Generation failed: {e}")
        return ""

def clean_vision_response(response: str) -> Dict[str, Any]:
    """Clean and parse vision response to valid JSON"""
    print(f"📝 Raw response length: {len(response)} chars")
    
    # Remove markdown formatting
    response = response.replace('```json', '').replace('```', '').strip()
    
    # Find JSON object boundaries
    start_idx = response.find('{')
    end_idx = response.rfind('}')
    
    if start_idx != -1 and end_idx != -1 and end_idx > start_idx:
        json_str = response[start_idx:end_idx + 1]
        
        # Basic cleanup
        json_str = re.sub(r',\s*}', '}', json_str)  # Remove trailing commas
        json_str = re.sub(r',\s*]', ']', json_str)  # Remove trailing commas in arrays
        
        try:
            # Parse and clean the structure
            data = json.loads(json_str)
            
            # Ensure all required fields exist with defaults
            cleaned_data = {
                "title": data.get("title", "Unknown Title"),
                "authors": data.get("authors", []),
                "summary": data.get("summary", "No summary available"),
                "keywords": data.get("keywords", []),
                "methods": data.get("methods", "No methods described"),
                "results": str(data.get("results", "No results available")),  # Ensure string
                "references": data.get("references", []),  # Keep all references
                "funding_sources": data.get("funding_sources", []),
                "conference_info": data.get("conference_info", {})
            }
            
            return cleaned_data
            
        except json.JSONDecodeError as e:
            print(f"⚠️ JSON parsing failed: {e}")
            # Return minimal structure if parsing fails
            return {
                "title": "Extraction Failed",
                "authors": [],
                "summary": "Could not parse response",
                "keywords": [],
                "methods": "Parsing error",
                "results": "Parsing error", 
                "references": [],
                "funding_sources": [],
                "conference_info": {}
            }
    
    # If no JSON found, return empty structure
    return {
        "title": "No JSON Found",
        "authors": [],
        "summary": "No structured data extracted",
        "keywords": [],
        "methods": "No data",
        "results": "No data",
        "references": [],
        "funding_sources": [],
        "conference_info": {}
    }

print("✅ Vision extraction functions defined")
print("🎯 Ready to process any scientific poster image!")


✅ Vision extraction functions defined
🎯 Ready to process any scientific poster image!


## 🚀 Run Vision-Based Extraction


In [4]:
if torch.cuda.is_available():
    print("🚀 Running Method 5: Qwen2-VL Vision Direct Extraction")
    print("=" * 65)
    
    # Load model
    model, processor = load_qwen2vl_model()
    
    if model is not None:
        # Convert PDF to image
        pdf_path = "../data/test-poster.pdf"
        images = convert_pdf_to_images(pdf_path, dpi=200)
        
        if images:
            # Use first page
            image = images[0]
            print(f"📸 Processing image: {image.size[0]}x{image.size[1]} pixels")
            
            # Create prompt
            prompt = create_vision_prompt()
            
            # Extract metadata
            start_time = time.time()
            response = extract_with_qwen2vl(model, processor, image, prompt)
            end_time = time.time()
            
            if response:
                # Clean and parse response
                results = clean_vision_response(response)
                
                # Display results
                print("\n📊 EXTRACTION RESULTS:")
                print("=" * 50)
                print(f"📄 TITLE: {results.get('title', 'N/A')}")
                print(f"👥 AUTHORS: {len(results.get('authors', []))} found")
                for i, author in enumerate(results.get('authors', []), 1):
                    print(f"   {i}. {author.get('name', 'N/A')} - {author.get('affiliations', ['N/A'])}")
                
                print(f"💰 FUNDING: {results.get('funding_sources', ['None found'])}")
                print(f"📚 REFERENCES: {len(results.get('references', []))} found")
                print(f"⏱️ Processing time: {end_time - start_time:.1f} seconds")
                
                # Save results
                output_path = "../output/method5_qwen2vl_vision_results.json"
                os.makedirs("../output", exist_ok=True)
                
                with open(output_path, 'w') as f:
                    json.dump(results, f, indent=2)
                
                print(f"💾 Results saved to: {output_path}")
                print("✅ Method 5 completed successfully!")
                print("🎯 Same prompt as DeepSeek/Mistral, working vision results!")
                
            else:
                print("❌ No response generated")
        else:
            print("❌ No images extracted from PDF")
    else:
        print("❌ Failed to load model")
        
else:
    print("❌ CUDA not available - vision models require GPU")


🚀 Running Method 5: Qwen2-VL Vision Direct Extraction
🤖 Loading Qwen2-VL model...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


✅ Qwen2-VL loaded successfully
📄 Converting PDF to images (DPI: 200)...


   Page 1: 5512x7874 pixels
📸 Processing image: 5512x7874 pixels
🔄 Generating response with Qwen2-VL...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


📝 Raw response length: 2845 chars

📊 EXTRACTION RESULTS:
📄 TITLE: INFLUENCE OF DRUG-POLYMER INTERACTIONS ON RELEASE KINETICS OF PLGA AND PLA/PEG NPS
👥 AUTHORS: 5 found
   1. Merve Gul - ['Department of Drug Sciences, University of Pavia']
   2. Ida Genta - ['Department of Chemical Engineering, Universitat Politècnica de Catalunya (UPC-EEBE)']
   3. Maria M. Perez Madrigal - ['Barcelona Research Center for Multiscale Science and Engineering, EEBE, Universitat Politècnica de Catalunya']
   4. Carlos Aleman - ['Barcelona Research Center for Multiscale Science and Engineering, EEBE, Universitat Politècnica de Catalunya']
   5. Enrica Chiesa - ['Barcelona Research Center for Multiscale Science and Engineering, EEBE, Universitat Politècnica de Catalunya']
💰 FUNDING: []
📚 REFERENCES: 0 found
⏱️ Processing time: 42.2 seconds
💾 Results saved to: ../output/method5_qwen2vl_vision_results.json
✅ Method 5 completed successfully!
🎯 Same prompt as DeepSeek/Mistral, working vision results!
