# Method 2a: Llama 3.2 8B Local Extraction

## Overview
Local large language model for high-quality poster metadata extraction. Runs entirely on your hardware without API dependencies.

## Accuracy Note
The accuracy estimate is unvalidated - based on limited testing only. Actual accuracy must be determined through proper Cochran sampling validation before production use.

## Performance Characteristics
- **Estimated Accuracy**: 85-90% (unvalidated - requires Cochran sampling validation)
- **Cost**: $0 (runs locally, only electricity costs)
- **Speed**: 15-45 seconds per poster (single), ~2-3s per poster (RTX 4090 batched)
- **Hallucination Risk**: Very Low (structured prompting + greedy decoding)
- **Setup**: Medium-Complex - requires model download and 8GB+ VRAM

## Hardware Requirements
- **GPU**: 8GB+ VRAM (RTX 4090 recommended)
- **RAM**: 16GB+ system memory
- **Storage**: ~16GB for model files

## Best For
- Privacy-sensitive environments requiring higher accuracy than smaller models
- Organizations with GPU resources but API budget constraints
- Research applications requiring reproducible, deterministic results

In [1]:
# Imports and setup
import os
import warnings
import contextlib
import io
import logging
# Suppress TensorFlow and CUDA initialization warnings
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
warnings.filterwarnings("ignore")
#from jtools import normalize_characters
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import fitz  # PyMuPDF
from pathlib import Path
import time
from datetime import datetime
from typing import Dict, List, Any, Optional

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🖥️  Using device: {device}")
print(f"💾 GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB" if torch.cuda.is_available() else "Using CPU")
print("✅ Environment ready for Method 2a: Llama 3.2 8b Local")

🖥️  Using device: cuda
💾 GPU memory: 25.3GB
✅ Environment ready for Method 2a: Llama 3.2 8b Local


In [2]:

import unicodedata
import re

def remove_quotes(text):
    """Remove surrounding quotes from text"""
    text = text.strip()
    if (text.startswith("'") and text.endswith("'")) or (text.startswith('"') and text.endswith('"')):
        return text[1:-1]
    return text

def clean_llama_response(response: str, field_type: str) -> str:
    """Clean up verbose Llama responses to extract just the content"""
    response = response.strip()
    
    # Remove common verbose prefixes
    prefixes_to_remove = [
        "The title of the poster is:",
        "Here are the author names extracted in a comma-separated list:",
        "Here is a 2-sentence summary of the poster:",
        "Here are 5-6 keywords extracted from the poster:",
        "Here are the methods mentioned in the poster:",
        "Here are the main results extracted from the poster:",
        "Here are the references found in the poster:",
        "Here are the funding sources found:",
        "Here is the conference information:",
        "The title is:",
        "Authors:",
        "Summary:",
        "Keywords:",
        "Methods:",
        "Results:",
        "References:",
        "Funding:",
        "Conference:"
    ]
    
    for prefix in prefixes_to_remove:
        if response.lower().startswith(prefix.lower()):
            response = response[len(prefix):].strip()
    
    # Remove numbered lists (1., 2., etc.)
    if field_type in ['keywords', 'methods', 'funding_sources']:
        lines = response.split('\n')
        cleaned_lines = []
        for line in lines:
            # Remove numbering like "1.", "2.", "*", "-" at start of line
            line = re.sub(r'^\s*[\d]+\.\s*', '', line)
            line = re.sub(r'^\s*[\*\-]\s*', '', line)
            if line.strip():
                cleaned_lines.append(line.strip())
        response = '\n'.join(cleaned_lines) if field_type == 'methods' else ', '.join(cleaned_lines)
    
    # Remove quotes and extra whitespace
    response = remove_quotes(response)
    
    return response.strip()
def normalize_characters(text):
    # Normalize Greek characters
    greek_chars = ['α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π', 'ρ', 'ς', 'σ', 'τ', 'υ', 'φ', 'χ', 'ψ', 'ω', 'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Ζ', 'Η', 'Θ', 'Ι', 'Κ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π', 'Ρ', 'Σ', 'Τ', 'Υ', 'Φ', 'Χ', 'Ψ', 'Ω']
    for char in greek_chars:
        text = text.replace(char, unicodedata.normalize('NFC', char))

    # Normalize space characters
    space_chars = ['\xa0', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u202f', '\u205f', '\u3000']
    for space in space_chars:
        text = text.replace(space, ' ')

    # Normalize single quotes
    single_quotes = ['‘', '’', '‛', '′', '‹', '›', '‚', '‟']
    for quote in single_quotes:
        text = text.replace(quote, "'")

    # Normalize double quotes
    double_quotes = ['“', '”', '„', '‟', '«', '»', '〝', '〞', '〟', '＂']
    for quote in double_quotes:
        text = text.replace(quote, '"')

    # Normalize brackets
    brackets = {
        '【': '[', '】': ']',
        '（': '(', '）': ')',
        '｛': '{', '｝': '}',
        '〚': '[', '〛': ']',
        '〈': '<', '〉': '>',
        '《': '<', '》': '>',
        '「': '[', '」': ']',
        '『': '[', '『': ']',
        '〔': '[', '〕': ']',
        '〖': '[', '〗': ']'
    }
    for old, new in brackets.items():
        text = text.replace(old, new)

    # Normalize hyphens and dashes
    hyphens_and_dashes = ['‐', '‑', '‒', '–', '—', '―']
    for dash in hyphens_and_dashes:
        text = text.replace(dash, '-')

    # Normalize line breaks
    line_breaks = ['\r\n', '\r']
    for line_break in line_breaks:
        text = text.replace(line_break, '\n')

    # Normalize superscripts and subscripts to normal numbers
    superscripts = '⁰¹²³⁴⁵⁶⁷⁸⁹'
    subscripts = '₀₁₂₃₄₅₆₇₈₉'
    normal_numbers = '0123456789'

    for super_, sub_, normal in zip(superscripts, subscripts, normal_numbers):
        text = text.replace(super_, normal).replace(sub_, normal)

    # Remove or normalize any remaining special characters using the 'NFKD' method
    text = unicodedata.normalize('NFKD', text)

    return remove_quotes(text)


In [3]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from PDF"""
    doc = fitz.open(pdf_path)
    text = ""
    
    for page_num, page in enumerate(doc):
        page_text = page.get_text()
        if page_text:
            text += f"\n--- Page {page_num + 1} ---\n{page_text}"
            text = normalize_characters(text)
    doc.close()
    return text.strip()

class LlamaExtractor:
    """Llama 3.2 8B-Instruct based metadata extractor"""
    
    def __init__(self, model_name: str = "meta-llama/Meta-Llama-3-8B-Instruct"):
        print(f"📥 Loading {model_name}...")
        
        # Load tokenizer with stderr suppression
        with contextlib.redirect_stderr(io.StringIO()):
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load model with quantization if CUDA available
        if torch.cuda.is_available():
            bnb_config = BitsAndBytesConfig(
                load_in_8bit=True,
                bnb_8bit_compute_dtype=torch.float16
            )
            
            # Load model with stderr suppression
            with contextlib.redirect_stderr(io.StringIO()):
                self.model = AutoModelForCausalLM.from_pretrained(
                    model_name,
                    quantization_config=bnb_config,
                    device_map="auto",
                    torch_dtype=torch.float16
                )
        else:
            # CPU loading
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=torch.float32
            )
            device = torch.device("cpu")
            self.model = self.model.to(device)
        
        self.model.eval()
        print(f"✅ Model loaded successfully")
    
    def extract_field(self, text: str, field: str) -> Any:
        """Extract specific field using optimized prompts for clean output"""
        
        # More explicit prompts that discourage verbose responses
        prompts = {
            'title': f"""Extract only the title from this poster text. Provide just the title text, nothing else.

Text: "{text[:500]}"

Title:""",
            
            'authors': f"""Extract author names and affiliations from this poster. Format as: "Name1 (Institution1) | Name2 (Institution2)" or just names if no affiliations found.

Text: "{text[:600]}"

Authors:""",
            
            'summary': f"""Write a concise 2-sentence summary of this poster's research. Be direct and factual.

Text: "{text[:800]}"

Summary:""",
            
            'keywords': f"""Extract 5-6 key technical terms from this poster. List only the keywords separated by commas.

Text: "{text[:600]}"

Keywords:""",
            
            'methods': f"""Extract the research methods described in this poster. Be concise and specific.

Text: "{text[:800]}"

Methods:""",
            
            'results': f"""Extract the main research findings from this poster. Include specific numbers/measurements if present.

Text: "{text[:800]}"

Results:""",
            
            'references': f"""Extract references or citations from this poster. Format as: "Title1 (Authors, Year, Journal) | Title2 (Authors, Year, Journal)" or "None found" if no references.

Text: "{text[:1000]}"

References:""",
            
            'funding_sources': f"""Extract funding sources, grants, or acknowledgments from this poster. List funding agencies or grant numbers separated by commas, or "None found".

Text: "{text[:800]}"

Funding:""",
            
            'conference_info': f"""Extract conference information from this poster. Format as: "Location: City, Country | Date: date range" or "None found" if not mentioned.

Text: "{text[:600]}"

Conference:"""
        }
        
        if field not in prompts:
            return ""
        
        prompt = prompts[field]
        
        # Create chat template with explicit instructions for conciseness
        messages = [
            {"role": "system", "content": "You are a precise data extraction assistant. Provide only the requested information without explanatory text, prefixes, or formatting. Be direct and concise."},
            {"role": "user", "content": prompt}
        ]
        
        # Apply chat template
        text_input = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
        
        # Tokenize
        inputs = self.tokenizer(
            text_input,
            return_tensors="pt",
            truncation=True,
            max_length=1024
        ).to(self.model.device)
        
        # Generate with greedy decoding for deterministic output
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=150,
                do_sample=False,     # Greedy decoding = deterministic (most probable token)
                pad_token_id=self.tokenizer.pad_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
                repetition_penalty=1.1  # Prevent repetition
                # Note: temperature/top_p not needed with do_sample=False
            )
        
        # Decode response
        response = self.tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        
        # Clean up the response
        response = clean_llama_response(response, field)
        
        # Parse response based on field
        if field == "authors":
            authors = []
            # Handle both simple names and "Name (Institution)" format
            if "|" in response:
                # Format: "Name1 (Institution1) | Name2 (Institution2)"
                author_parts = response.split("|")
            else:
                # Format: "Name1, Name2, Name3" or "Name1 (Inst1), Name2 (Inst2)"
                author_parts = response.split(",")
            
            for author_part in author_parts:
                author_part = author_part.strip()
                if not author_part:
                    continue
                    
                # Check if has institution in parentheses
                if "(" in author_part and ")" in author_part:
                    name_part = author_part.split("(")[0].strip()
                    affil_part = author_part.split("(")[1].split(")")[0].strip()
                    authors.append({
                        "name": name_part,
                        "affiliations": [affil_part],
                        "email": None
                    })
                else:
                    authors.append({
                        "name": author_part,
                        "affiliations": [],
                        "email": None
                    })
            return authors[:6]  # Limit to 6
            
        elif field == "keywords":
            keywords = [k.strip() for k in response.split(",") if k.strip()]
            return keywords[:8]  # Limit to 8
            
        elif field == "references":
            if response.lower() == "none found" or not response.strip():
                return []
            
            references = []
            # Handle format: "Title1 (Authors, Year, Journal) | Title2 (Authors, Year, Journal)"
            if "|" in response:
                ref_parts = response.split("|")
            else:
                ref_parts = [response]
                
            for ref_part in ref_parts:
                ref_part = ref_part.strip()
                if not ref_part or ref_part.lower() == "none found":
                    continue
                    
                # Try to parse "Title (Authors, Year, Journal)" format
                if "(" in ref_part and ")" in ref_part:
                    title = ref_part.split("(")[0].strip()
                    details = ref_part.split("(")[1].split(")")[0].strip()
                    
                    # Split details by comma and try to extract year
                    detail_parts = [d.strip() for d in details.split(",")]
                    year = None
                    journal = ""
                    authors = ""
                    
                    for part in detail_parts:
                        if part.isdigit() and len(part) == 4:  # Likely a year
                            year = int(part)
                        elif year is None:
                            authors = part if not authors else authors + ", " + part
                        else:
                            journal = part if not journal else journal + ", " + part
                    
                    references.append({
                        "title": title,
                        "authors": authors,
                        "year": year,
                        "journal": journal
                    })
                else:
                    # Fallback: treat whole thing as title
                    references.append({
                        "title": ref_part,
                        "authors": "",
                        "year": None,
                        "journal": ""
                    })
            return references[:5]  # Limit to 5
            
        elif field == "funding_sources":
            if response.lower() == "none found" or not response.strip():
                return []
            
            funding = [f.strip() for f in response.split(",") if f.strip() and f.strip().lower() != "none found"]
            return funding[:5]  # Limit to 5
            
        elif field == "conference_info":
            if response.lower() == "none found" or not response.strip():
                return {"location": None, "date": None}
            
            location = None
            date = None
            
            # Handle format: "Location: City, Country | Date: date range"
            if "|" in response:
                parts = response.split("|")
                for part in parts:
                    part = part.strip()
                    if part.lower().startswith("location:"):
                        location = part.split(":", 1)[1].strip()
                    elif part.lower().startswith("date:"):
                        date = part.split(":", 1)[1].strip()
            else:
                # Try to detect location/date in single string
                if any(word in response.lower() for word in ["location", "city", "country"]):
                    location = response.strip()
                elif any(word in response.lower() for word in ["date", "may", "june", "july", "august", "september"]):
                    date = response.strip()
                else:
                    # Assume it's location if no clear indicator
                    location = response.strip()
            
            return {"location": location, "date": date}
        else:
            return response

print("✅ LlamaExtractor class defined")

✅ LlamaExtractor class defined


In [4]:
# Run extraction
pdf_path = "../data/test-poster.pdf"

if Path(pdf_path).exists():
    print("🚀 Running Method 2a: Llama 3.2 8B Local Extraction")
    print("=" * 60)
    
    start_time = time.time()
    
    # Extract text
    text = extract_text_from_pdf(pdf_path)
    print(f"📏 Extracted {len(text)} characters")
    
    try:
        # Initialize extractor
        print("🤖 Initializing Llama 3.2 8B model...")
        extractor = LlamaExtractor()
        
        # Extract each field
        print("🔍 Extracting metadata components...")
        
        title = extractor.extract_field(text, "title")
        authors = extractor.extract_field(text, "authors")
        summary = extractor.extract_field(text, "summary")
        keywords = extractor.extract_field(text, "keywords")
        methods = extractor.extract_field(text, "methods")
        results_text = extractor.extract_field(text, "results")
        references = extractor.extract_field(text, "references")
        funding_sources = extractor.extract_field(text, "funding_sources")
        conference_info = extractor.extract_field(text, "conference_info")
        
        # Compile results
        results = {
            "title": title,
            "authors": authors,
            "summary": summary,
            "keywords": keywords,
            "methods": methods,
            "results": results_text,
            "references": references,
            "funding_sources": funding_sources,
            "conference_info": conference_info,
            "extraction_metadata": {
                "timestamp": datetime.now().isoformat(),
                "processing_time": time.time() - start_time,
                "method": "llama_local",
                "model": "Meta-Llama-3-8B-Instruct",
                "device": str(next(extractor.model.parameters()).device),
                "text_length": len(text),
                "do_sample": False,
                "max_tokens": 150
            }
        }
        
        # Display results
        print(f"\n📄 TITLE: {results['title'][:100]}")
        print(f"\n👥 AUTHORS: {len(results['authors'])} found")
        for author in results["authors"]:
            affil_str = f" ({', '.join(author['affiliations'])})" if author['affiliations'] else ""
            print(f"   • {author['name']}{affil_str}")
        
        print(f"\n📝 SUMMARY: {results['summary'][:100]}...")
        print(f"\n🔑 KEYWORDS: {', '.join(results['keywords'][:5])}")
        print(f"\n🔬 METHODS: {results['methods'][:100]}...")
        print(f"\n📊 RESULTS: {results['results'][:100]}...")
        print(f"\n📚 REFERENCES: {len(results['references'])} found")
        for ref in results['references'][:2]:  # Show first 2
            print(f"   • {ref['title'][:50]}...")
        print(f"\n💰 FUNDING: {len(results['funding_sources'])} sources")
        for funding in results['funding_sources'][:2]:  # Show first 2
            print(f"   • {funding[:50]}...")
        print(f"\n🏛️  CONFERENCE: {results['conference_info']['location']} | {results['conference_info']['date']}")
        print(f"⏱️  Processing time: {results['extraction_metadata']['processing_time']:.2f}s")
        
        # Save results with corrected filename
        output_path = Path("../output/method2a_llama_results.json")
        output_path.parent.mkdir(exist_ok=True)
        
        with open(output_path, "w") as f:
            json.dump(results, f, indent=2)
        
        print(f"💾 Results saved to: {output_path}")
        print("✅ Method 2a (Llama) completed successfully!")
        
    except Exception as e:
        print(f"❌ Llama extraction failed: {e}")
        print("   This may be due to insufficient GPU memory or model download issues")
        
else:
    print("❌ Test poster not found")

🚀 Running Method 2a: Llama 3.2 8B Local Extraction
📏 Extracted 3732 characters
🤖 Initializing Llama 3.2 8B model...
📥 Loading meta-llama/Meta-Llama-3-8B-Instruct...


E0000 00:00:1756517514.853929 2450263 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756517514.859618 2450263 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1756517514.875174 2450263 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756517514.875192 2450263 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756517514.875194 2450263 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1756517514.875196 2450263 computation_placer.cc:177] computation placer already registered. Please check linka

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


✅ Model loaded successfully
🔍 Extracting metadata components...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



📄 TITLE: INFLUENCE OF DRUG-POLYMER INTERACTIONS ON RELEASE KINETICS OF PLGA AND PLA/PEG NPS

👥 AUTHORS: 6 found
   • Merve Gul (University of Pavia
   • Department of Chemical Engineering, Universitat Politècnica de Catalunya)
   • Ida Genta (University of Pavia)
   • Maria M. Perez Madrigal (Universitat Politècnica de Catalunya)
   • Carlos Aleman (Universitat Politècnica de Catalunya
   • Barcelona Research Center for Multiscale Science and Engineering)

📝 SUMMARY: The study investigates the influence of drug-polymer interactions on release kinetics of poly(lactic...

🔑 KEYWORDS: PLGA, PLA, PEG, NPS, AMR

🔬 METHODS: • Microfluidic-based synthesis of nano-sized carriers for drug delivery systems (NDDS)...

📊 RESULTS: • The release kinetics of PLGA and PLA/PEG NPs were influenced by drug-polymer interactions.
• The e...

📚 REFERENCES: 2 found
   • ...
   • ...

💰 FUNDING: 1 sources
   • None found....

🏛️  CONFERENCE: None | None
⏱️  Processing time: 35.92s
💾 Results saved to: ../o