# PaperFlow: Complete Academic Paper Processing Demo

PaperFlow is a unified pipeline for academic paper ingestion, extraction, and RAG (Retrieval-Augmented Generation). It allows you to:

- **Search** across multiple academic sources (arXiv, PubMed, Semantic Scholar, OpenAlex)
- **Download** PDFs automatically
- **Extract** text and structure from PDFs
- **Chunk** content for RAG applications
- **Embed** and store in vector databases
- **Query** papers using natural language

## Features

- üîç Multi-source search (arXiv, PubMed, Semantic Scholar, OpenAlex)
- üì• Automatic PDF downloading
- üìÑ Advanced PDF text extraction (Marker AI, Docling, MarkItDown)
- ‚úÇÔ∏è Intelligent text chunking
- üß† Vector embeddings for RAG
- üíæ ChromaDB integration
- üîó LangChain compatibility
- üìä Tabular result display
- üíª Command-line interface

## Installation

Install the latest version of PaperFlow with all optional dependencies.

In [3]:
!pip uninstall paperflow -y



In [None]:
!pip install paperflow[extraction-all,rag,providers] --upgrade

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


## Demo: All Academic Providers

This demo shows how to use **all academic paper providers** in PaperFlow:
- üîç **arXiv** - Preprints and technical papers
- üè• **PubMed** - Biomedical and life sciences research  
- üìö **Semantic Scholar** - AI-powered academic search
- üåê **OpenAlex** - Open catalog of scholarly works

The demo covers:
1. Search across all providers
2. Display results in formatted tables and JSON
3. Download PDFs from multiple sources
4. Extract text and structure from PDFs (with optional GPU acceleration)
5. Prepare content for RAG applications

The pipeline separates download and extraction phases for better control and efficiency.

## Configuration Options

**GPU Support**: If you have a CUDA-compatible GPU, you can enable GPU acceleration for faster PDF extraction:

```python
USE_GPU = True  # Enable GPU acceleration
```

**Custom PDF Directory**: Specify where to save downloaded PDFs:

```python
PDF_DIR = './my_papers'  # Custom directory for PDFs
```

**PDF Extraction Backend**: Choose the best extraction method for your needs:

```python
# Options: "auto", "marker", "docling", "markitdown"
EXTRACTION_BACKEND = "auto"  # Auto-select: marker ‚Üí docling ‚Üí markitdown
# EXTRACTION_BACKEND = "marker"      # High quality, best for academic papers
# EXTRACTION_BACKEND = "docling"     # Good table/figure extraction  
# EXTRACTION_BACKEND = "markitdown"  # Lightweight, fast, CPU only
```

**Usage**:
```python
pipeline = PaperPipeline(
    gpu=USE_GPU, 
    pdf_dir=PDF_DIR,
    extraction_backend=EXTRACTION_BACKEND
)
```

In [None]:
from paperflow import PaperPipeline

# Configuration
USE_GPU = False  # Set to True if you have CUDA GPU and want faster extraction
PDF_DIR = './test_pdfs'

# Choose PDF extraction backend:
# - "auto": Try marker ‚Üí docling ‚Üí markitdown (recommended)
# - "marker": High quality, best for academic papers, GPU support
# - "docling": Good table/figure extraction, IBM, GPU support  
# - "markitdown": Lightweight, fast, CPU only, Microsoft
EXTRACTION_BACKEND = "auto"

print(f"GPU acceleration: {'Enabled' if USE_GPU else 'Disabled'}")
print(f"PDF directory: {PDF_DIR}")
print(f"Extraction backend: {EXTRACTION_BACKEND}")

‚è≥ Loading Marker AI models...
‚úÖ Marker AI loaded
Searching for papers on transformers...
Found 3 papers in 551ms
Sources: ['arxiv']

+-----+----------------------------------------+------------------------------+--------+----------+--------------+
|   # | Title                                  | Authors                      |   Year | Source   | Link/ID      |
|   1 | Dilated Neighborhood Attention         | Ali Hassani, Humphrey Shi    |   2022 | arxiv    | 2209.15001v3 |
|     | Transformer                            |                              |        |          |              |
+-----+----------------------------------------+------------------------------+--------+----------+--------------+
|   2 | Mask-Attention-Free Transformer for 3D | Xin Lai, Yuhui Yuan, Ruihang |   2023 | arxiv    | 2309.01692v1 |
|     | Instance Segmentation                  | Chu et al.                   |        |          |              |
+-----+----------------------------------------+----------

Recognizing Layout: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 17/17 [01:32<00:00,  5.45s/it]
Running OCR Error Detection: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  4.44it/s]
Detecting bboxes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:08<00:00,  2.93s/it]
Recognizing Text:  14%|‚ñà‚ñç        | 51/361 [09:44<34:34,  6.69s/it]   

KeyboardInterrupt: 

In [None]:
from paperflow import PaperPipeline

# Create pipeline with selected backend
pipeline = PaperPipeline(
    gpu=USE_GPU, 
    pdf_dir=PDF_DIR,
    extraction_backend=EXTRACTION_BACKEND
)

print(f"‚úÖ Pipeline created with {pipeline._extractor.active_backend} backend")

## 1. arXiv Provider Demo

Search for computer science papers on arXiv, the preprint server for physics, mathematics, computer science, and more.

In [None]:
# Search arXiv
print('üîç Searching arXiv for "transformer attention" papers...')
results = pipeline.search('transformer attention mechanism', sources=['arxiv'], max_results=3)

print(f"‚úÖ Found {results.total_found} papers in {results.search_time_ms}ms")

In [None]:
# Display arXiv results
if results.papers:
    print("\nüìã arXiv Search Results:")
    print("-" * 80)
    
    for i, paper in enumerate(results.papers, 1):
        title = paper['title'][:70] + "..." if len(paper['title']) > 70 else paper['title']
        authors = ", ".join([a['name'] for a in paper.get('authors', [])[:2]])
        if len(paper.get('authors', [])) > 2:
            authors += " et al."
        
        print(f"{i}. {title}")
        print(f"   Authors: {authors}")
        print(f"   Year: {paper.get('year', 'N/A')}")
        print(f"   arXiv ID: {paper.get('arxiv_id', 'N/A')}")
        print(f"   Citations: {paper.get('citation_count', 0)}")
        print()
else:
    print("‚ùå No papers found")

In [None]:
# Display arXiv results in JSON format
if results.papers:
    import json
    print("üìã arXiv Search Results in JSON:")
    print(json.dumps(results.papers, indent=2))

## 2. PubMed Provider Demo

Search for biomedical research papers in PubMed, the premier database for biomedical literature.

In [None]:
# Search PubMed
print('üîç Searching PubMed for "CRISPR gene editing" papers...')
pubmed_results = pipeline.search('CRISPR gene editing therapy', sources=['pubmed'], max_results=3)

print(f"‚úÖ Found {pubmed_results.total_found} papers in {pubmed_results.search_time_ms}ms")

In [None]:
# Display PubMed results
if pubmed_results.papers:
    print("\nüìã PubMed Search Results:")
    print("-" * 80)
    
    for i, paper in enumerate(pubmed_results.papers, 1):
        title = paper['title'][:70] + "..." if len(paper['title']) > 70 else paper['title']
        authors = ", ".join([a['name'] for a in paper.get('authors', [])[:2]])
        if len(paper.get('authors', [])) > 2:
            authors += " et al."
        
        print(f"{i}. {title}")
        print(f"   Authors: {authors}")
        print(f"   Year: {paper.get('year', 'N/A')}")
        print(f"   DOI: {paper.get('doi', 'N/A')}")
        print(f"   Citations: {paper.get('citation_count', 0)}")
        print()

In [None]:
# Display PubMed results in JSON format
if pubmed_results.papers:
    import json
    print("üìã PubMed Search Results in JSON:")
    print(json.dumps(pubmed_results.papers, indent=2))

## 3. Semantic Scholar Provider Demo

Search using Semantic Scholar's AI-powered academic search engine.

In [None]:
# Search Semantic Scholar
print('üîç Searching Semantic Scholar for "large language models"...')
sem_results = pipeline.search('large language models GPT', sources=['semantic_scholar'], max_results=3)

print(f"‚úÖ Found {sem_results.total_found} papers in {sem_results.search_time_ms}ms")

In [None]:
# Display Semantic Scholar results
if sem_results.papers:
    print("\nüìã Semantic Scholar Results:")
    print("-" * 80)
    
    for i, paper in enumerate(sem_results.papers, 1):
        title = paper['title'][:70] + "..." if len(paper['title']) > 70 else paper['title']
        authors = ", ".join([a['name'] for a in paper.get('authors', [])[:2]])
        if len(paper.get('authors', [])) > 2:
            authors += " et al."
        
        print(f"{i}. {title}")
        print(f"   Authors: {authors}")
        print(f"   Year: {paper.get('year', 'N/A')}")
        print(f"   DOI: {paper.get('doi', 'N/A')}")
        print(f"   Citations: {paper.get('citation_count', 0)}")
        print()

In [None]:
# Display Semantic Scholar results in JSON format
if sem_results.papers:
    import json
    print("üìã Semantic Scholar Search Results in JSON:")
    print(json.dumps(sem_results.papers, indent=2))

## 4. OpenAlex Provider Demo

Search the OpenAlex catalog, which covers millions of scholarly works from all disciplines.

In [None]:
# Search OpenAlex
print('üîç Searching OpenAlex for "climate change adaptation"...')
openalex_results = pipeline.search('climate change adaptation strategies', sources=['openalex'], max_results=3)

print(f"‚úÖ Found {openalex_results.total_found} papers in {openalex_results.search_time_ms}ms")

In [None]:
# Display OpenAlex results
if openalex_results.papers:
    print("\nüìã OpenAlex Results:")
    print("-" * 80)
    
    for i, paper in enumerate(openalex_results.papers, 1):
        title = paper['title'][:70] + "..." if len(paper['title']) > 70 else paper['title']
        authors = ", ".join([a['name'] for a in paper.get('authors', [])[:2]])
        if len(paper.get('authors', [])) > 2:
            authors += " et al."
        
        print(f"{i}. {title}")
        print(f"   Authors: {authors}")
        print(f"   Year: {paper.get('year', 'N/A')}")
        print(f"   DOI: {paper.get('doi', 'N/A')}")
        print(f"   Citations: {paper.get('citation_count', 0)}")
        print()

In [None]:
# Display OpenAlex results in JSON format
if openalex_results.papers:
    import json
    print("üìã OpenAlex Search Results in JSON:")
    print(json.dumps(openalex_results.papers, indent=2))

## 5. Complete Pipeline Demo

Demonstrate the full pipeline: search ‚Üí download ‚Üí extract ‚Üí chunk ‚Üí embed.

In [None]:
# Full pipeline demonstration
print("üöÄ Running complete pipeline...")

# 1. Search
print("1. Searching for papers...")
search_results = pipeline.search("neural networks", sources=["arxiv"], max_results=1)

if search_results.papers:
    paper_dict = search_results.papers[0]
    
    # 2. Download
    print("2. Downloading PDF...")
    paper = pipeline.download(paper_dict)
    
    # 3. Extract
    print("3. Extracting content...")
    paper = pipeline.extract(paper)
    
    # 4. Chunk
    print("4. Creating chunks...")
    paper = pipeline.chunk(paper)
    
    # 5. Embed (if embeddings available)
    print("5. Creating embeddings...")
    try:
        paper = pipeline.embed(paper)
        print("‚úÖ Pipeline completed successfully!")
        
        # Show results
        print(f"\nüìä Results:")
        print(f"Title: {paper.metadata.title[:50]}...")
        print(f"Sections: {len(paper.sections)}")
        print(f"Chunks: {len(paper.chunks)}")
        print(f"Has embeddings: {paper.has_embeddings}")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Embedding failed (missing dependencies): {e}")
        print("‚úÖ Pipeline completed (without embeddings)")
        
        print(f"\nüìä Results:")
        print(f"Title: {paper.metadata.title[:50]}...")
        print(f"Sections: {len(paper.sections)}")
        print(f"Chunks: {len(paper.chunks)}")
else:
    print("‚ùå No papers found for pipeline demo")

## Summary

This notebook demonstrated:

‚úÖ **All 4 academic providers**: arXiv, PubMed, Semantic Scholar, OpenAlex  
‚úÖ **Search across all sources** with formatted and JSON output  
‚úÖ **Complete processing pipeline**: search ‚Üí download ‚Üí extract ‚Üí chunk ‚Üí embed  
‚úÖ **PDF extraction backends**: Marker AI, Docling, MarkItDown with auto-fallback  
‚úÖ **GPU acceleration** support for faster processing  

### Key Features Used:
- Unified search interface across all providers
- Automatic PDF downloading and text extraction
- Intelligent text chunking for RAG
- Vector embeddings for semantic search
- Tabular and JSON result display
- Error handling and graceful fallbacks

Happy researching! üî¨üìö

In [None]:
import json

# Display results in JSON format
print("Search Results in JSON:")
print(json.dumps(results.papers, indent=2))

## What Happens Next?

After processing, you can:

- **Query the papers**: Use RAG to ask questions about the content
- **Export to LangChain**: Get LangChain documents for further processing
- **Save to vector database**: Store embeddings for semantic search
- **Analyze content**: Access extracted sections, chunks, and metadata

## Command Line Usage

You can also use paperflow from the command line:

```bash
# Install with CLI support
pip install paperflow

# Search and display results
paperflow "transformer attention" --sources arxiv --max-results 5
```

## Advanced Usage

For more advanced features, install optional dependencies:

```bash
pip install paperflow[all]  # Full installation
pip install paperflow[extraction]  # PDF extraction only
pip install paperflow[rag]  # RAG features only
```

Check out the [documentation](https://github.com/osllmai/paperflow) for more examples!

In [None]:
from paperflow import PaperPipeline

# Configuration
USE_GPU = False  # Set to True if you have CUDA GPU and want faster extraction
PDF_DIR = './test_pdfs'

# Create pipeline
pipeline = PaperPipeline(gpu=USE_GPU, pdf_dir=PDF_DIR)

# Search for papers
print('Searching for papers on transformers...')
results = pipeline.search('transformer attention', sources=['arxiv'], max_results=3)

# Display results
print(results)
print()

# Process all papers
print('Processing all papers...')
for i, paper_meta in enumerate(results.papers, 1):
    print(f'Processing paper {i}: {paper_meta["title"][:50]}...')
    paper = pipeline.process(paper_meta)
    print(f'  - PDF saved: {paper.pdf_path}')
    print(f'  - Sections: {len(paper.sections)}, Chunks: {len(paper.chunks)}')
    print()

print('All done!')