https://claude.ai/chat/2635b704-20b2-4544-9801-76e006c43fe4

In [4]:
import os
from pathlib import Path

# standard path
cache_path = Path.home() / ".cache" / "huggingface" / "hub"

if cache_path.exists():
    print("Found these model folders:")
    for item in os.listdir(cache_path):
        print(f"- {item}")
else:
    print(f"Directory not found at {cache_path}")

Found these model folders:
- .locks
- models--bert-base-uncased
- models--ds4sd--docling-layout-old
- models--ds4sd--docling-models
- models--facebook--sam-3d-body-dinov3
- models--jetjodh--sam-3d-body-dinov3


In [6]:
!pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting ipywidgets
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.8-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl (914 kB)
   ---------------------------------------- 0.0/914.9 kB ? eta -:--:--
   ---------------------------------------- 914.9/914.9 kB 13.9 MB/s  0:00:00
Downloading widgetsnbextension-4.0.15-py3-none-any.whl (2.2 MB)
   ---------------------------------------- 0.0/2.2 MB ? eta -:--:--
   ---------------------------------------- 2.2/2.2 MB 10.3 MB/s  0:00:00
Installing collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets

   -----------------

In [2]:
!pip install marker-pdf arxiv pymed requests torch biopython

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [7]:
# --- MARKER AI SETUP ---
import sys
import traceback

print("‚è≥ Loading Marker AI models...")

try:
    from marker.converters.pdf import PdfConverter
    from marker.models import create_model_dict
    from marker.output import text_from_rendered

    converter = PdfConverter(
        artifact_dict=create_model_dict(),
    )
    print("‚úÖ Marker models loaded.")

except ImportError as e:
    print("‚ùå ImportError while importing Marker.")
    print(f"   Python: {sys.executable}")
    print(f"   Error : {e}")
    print("\n‚úÖ Fix:")
    print("   1) Activate your env:  conda activate article")
    print("   2) Install correct pkg: python -m pip install -U marker-pdf")
    print("\nüîé Full traceback:")
    traceback.print_exc()
    raise  # fail loudly so you see the real reason

except Exception:
    print("‚ùå Model Load Error (non-import).")
    print(f"   Python: {sys.executable}")
    print("\nüîé Full traceback:")
    traceback.print_exc()
    raise


‚è≥ Loading Marker AI models...
‚úÖ Marker models loaded.


In [14]:
from marker.models import create_model_dict

model_dict = create_model_dict()

print("üîç Scanning loaded models for file paths...\n")

for key, model_obj in model_dict.items():
    # Try to find the path in different possible locations within the object
    path = "Path not found"
    
    # Check if it has a model attribute with a config
    if hasattr(model_obj, 'model') and hasattr(model_obj.model, 'config'):
        path = getattr(model_obj.model.config, '_name_or_path', "Path not found")
    
    # Check if the object itself has a config (common in some Marker versions)
    elif hasattr(model_obj, 'config'):
        path = getattr(model_obj.config, '_name_or_path', "Path not found")

    print(f"üì¶ Model Key: {key}")
    print(f"üìÇ Location:  {path}\n")

üîç Scanning loaded models for file paths...

üì¶ Model Key: layout_model
üìÇ Location:  Path not found

üì¶ Model Key: recognition_model
üìÇ Location:  Path not found

üì¶ Model Key: table_rec_model
üìÇ Location:  C:\Users\llmserver\AppData\Local\datalab\datalab\Cache\models\table_recognition/2025_02_18

üì¶ Model Key: detection_model
üìÇ Location:  C:\Users\llmserver\AppData\Local\datalab\datalab\Cache\models\text_detection/2025_05_07

üì¶ Model Key: ocr_error_model
üìÇ Location:  C:\Users\llmserver\AppData\Local\datalab\datalab\Cache\models\ocr_error_detection/2025_02_18



In [4]:

import os
import time
import requests
import arxiv
from Bio import Entrez
from datetime import datetime, timedelta
import uuid
import json
import re


In [5]:

# ---------------------------------------------------------
# CONFIGURATION
# ---------------------------------------------------------
MAX_RESULTS = 50
DOWNLOAD_DELAY = 2
NCBI_API_KEY = "162cefdacd4448a08831092c05eab6e73a09"
NCBI_EMAIL = "shakeri163@gmail.com"

# Folder structure
PDF_FOLDER = "papers_pdf"
MARKDOWN_FOLDER = "papers_markdown"
METADATA_FILE = "papers_metadata.json"

# arXiv Categories for filtering
ARXIV_CATEGORIES = {
    'cs.AI': 'Artificial Intelligence',
    'cs.LG': 'Machine Learning',
    'cs.CV': 'Computer Vision',
    'cs.CL': 'Computation and Language',
    'cs.NE': 'Neural and Evolutionary Computing',
    'q-bio.GN': 'Genomics',
    'q-bio.QM': 'Quantitative Methods',
    'physics.bio-ph': 'Biological Physics',
    'stat.ML': 'Machine Learning (Statistics)',
    'math.ST': 'Statistics Theory'
}

In [6]:

# ---------------------------------------------------------
# METADATA DATABASE
# ---------------------------------------------------------
class PaperDatabase:
    """Manages paper metadata with UUIDs and citations."""
    
    def __init__(self, db_file=METADATA_FILE):
        self.db_file = db_file
        self.papers = self.load()
    
    def load(self):
        """Load existing metadata database."""
        if os.path.exists(self.db_file):
            with open(self.db_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        return {}
    
    def save(self):
        """Save metadata database."""
        with open(self.db_file, 'w', encoding='utf-8') as f:
            json.dump(self.papers, f, indent=2, ensure_ascii=False)
    
    def add_paper(self, metadata):
        """Add a paper with UUID and citation."""
        paper_uuid = str(uuid.uuid4())
        
        # Generate citation
        citation = self.generate_citation(metadata)
        
        # Store complete metadata
        self.papers[paper_uuid] = {
            'uuid': paper_uuid,
            'title': metadata['title'],
            'authors': metadata['authors'],
            'year': metadata.get('year', 'N/A'),
            'source': metadata['source'],
            'arxiv_id': metadata.get('arxiv_id', 'N/A'),
            'doi': metadata.get('doi', 'N/A'),
            'url': metadata['url'],
            'pdf_path': metadata.get('pdf_path', 'N/A'),
            'markdown_path': metadata.get('markdown_path', 'N/A'),
            'citation': citation,
            'categories': metadata.get('categories', []),
            'abstract': metadata.get('abstract', ''),
            'added_date': datetime.now().isoformat()
        }
        
        self.save()
        return paper_uuid, citation
    
    def generate_citation(self, metadata):
        """Generate APA-style citation."""
        authors = metadata['authors']
        if isinstance(authors, list):
            if len(authors) == 1:
                author_str = authors[0]
            elif len(authors) == 2:
                author_str = f"{authors[0]} & {authors[1]}"
            elif len(authors) > 2:
                author_str = f"{authors[0]} et al."
            else:
                author_str = "Unknown"
        else:
            author_str = authors
        
        year = metadata.get('year', 'n.d.')
        title = metadata['title']
        
        if metadata['source'] == 'arXiv':
            arxiv_id = metadata.get('arxiv_id', '')
            citation = f"{author_str} ({year}). {title}. arXiv preprint arXiv:{arxiv_id}."
        else:
            citation = f"{author_str} ({year}). {title}. {metadata['source']}."
        
        return citation
    
    def get_paper(self, paper_uuid):
        """Retrieve paper by UUID."""
        return self.papers.get(paper_uuid)
    
    def list_papers(self):
        """List all papers with UUIDs."""
        return self.papers


In [7]:
# ---------------------------------------------------------
# EXTRACT COMPREHENSIVE ARXIV METADATA
# ---------------------------------------------------------
def extract_arxiv_metadata(paper):
    """Extract all available metadata from arXiv paper."""
    metadata = {
        'title': paper.title,
        'authors': [a.name for a in paper.authors],
        'abstract': paper.summary,
        'year': paper.published.year,
        'published': paper.published.strftime('%Y-%m-%d'),
        'updated': paper.updated.strftime('%Y-%m-%d') if paper.updated else 'N/A',
        'arxiv_id': paper.get_short_id(),
        'entry_id': paper.entry_id,
        'doi': paper.doi if paper.doi else 'N/A',
        'primary_category': paper.primary_category,
        'categories': paper.categories,
        'comment': paper.comment if paper.comment else 'N/A',
        'journal_ref': paper.journal_ref if paper.journal_ref else 'N/A',
        'pdf_url': paper.pdf_url,
        'url': paper.entry_id,
        'source': 'arXiv'
    }
    
    return metadata

In [8]:
# ---------------------------------------------------------
# PDF DOWNLOAD AND ORGANIZATION
# ---------------------------------------------------------
def download_and_save_pdf(paper, source_type, db):
    """Download PDF and save with organized naming."""
    os.makedirs(PDF_FOLDER, exist_ok=True)
    
    if source_type == '1':  # arXiv
        metadata = extract_arxiv_metadata(paper)
        safe_name = sanitize_filename(paper.title)
        pdf_filename = f"{metadata['arxiv_id']}_{safe_name}.pdf"
        pdf_path = os.path.join(PDF_FOLDER, pdf_filename)
        
        print(f"   ‚¨áÔ∏è Downloading: {paper.title[:50]}...")
        try:
            paper.download_pdf(filename=pdf_path)
            print(f"   ‚úÖ PDF saved: {pdf_path}")
            metadata['pdf_path'] = pdf_path
            
            # Add to database
            paper_uuid, citation = db.add_paper(metadata)
            print(f"   üÜî UUID: {paper_uuid}")
            print(f"   üìù Citation: {citation[:80]}...")
            
            return True, paper_uuid
        except Exception as e:
            print(f"   ‚ùå Download failed: {e}")
            return False, None
            
    else:  # PubMed
        metadata = {
            'title': paper['title'],
            'authors': paper['authors'][:3] if paper['authors'] else ['Unknown'],
            'year': str(paper['date'])[:4] if paper['date'] else 'N/A',
            'source': 'PubMed Central',
            'url': f"https://www.ncbi.nlm.nih.gov/pmc/articles/{paper['pmc_id']}/",
            'pmc_id': paper['pmc_id'],
            'abstract': get_pubmed_abstract(paper['pmc_id'])
        }
        
        safe_name = sanitize_filename(paper['title'])
        pdf_filename = f"{paper['pmc_id']}_{safe_name}.pdf"
        pdf_path = os.path.join(PDF_FOLDER, pdf_filename)
        
        print(f"   ‚¨áÔ∏è Downloading: {paper['title'][:50]}...")
        
        if download_pubmed_pdf(paper['pmc_id'], pdf_path):
            print(f"   ‚úÖ PDF saved: {pdf_path}")
            metadata['pdf_path'] = pdf_path
            
            # Add to database
            paper_uuid, citation = db.add_paper(metadata)
            print(f"   üÜî UUID: {paper_uuid}")
            print(f"   üìù Citation: {citation[:80]}...")
            
            return True, paper_uuid
        else:
            return False, None

def sanitize_filename(title):
    """Create safe filename from title."""
    safe = re.sub(r'[^\w\s-]', '', title)
    safe = re.sub(r'[-\s]+', '_', safe)
    return safe[:50]


In [None]:
# ---------------------------------------------------------
# MARKER AI INITIALIZATION
# ---------------------------------------------------------
print("‚è≥ Loading Marker AI models...")
try:
    from marker.converters.pdf import PdfConverter
    from marker.models import create_model_dict
    from marker.output import text_from_rendered
    
    # Initialize converter (loads PyTorch models)
    converter = PdfConverter(
        artifact_dict=create_model_dict(),
    )
    print("‚úÖ Marker models loaded.")
    MARKER_AVAILABLE = True
except ImportError:
    print("‚ùå Critical Error: Marker library outdated or missing.")
    print("Run: pip install marker-pdf --upgrade")
    MARKER_AVAILABLE = False
except Exception as e:
    print(f"‚ùå Model Load Error: {e}")
    MARKER_AVAILABLE = False


In [10]:


# ---------------------------------------------------------
# SELECTIVE MARKDOWN EXTRACTION WITH MARKER AI
# ---------------------------------------------------------
def extract_sections_from_pdf(pdf_path):
    """
    Extract only important sections using Marker AI: Abstract, Introduction, Conclusion.
    """
    if not MARKER_AVAILABLE:
        print("   ‚ö†Ô∏è Marker AI not available. Skipping extraction.")
        return None
    
    try:
        print(f"   ü§ñ Processing with Marker AI...")
        
        # Convert PDF to markdown using Marker AI
        rendered = converter(pdf_path)
        full_text, _, images = text_from_rendered(rendered)
        
        print(f"   üìù Extracted {len(full_text)} characters")
        
        # Find sections using regex patterns
        sections = {
            'abstract': extract_section(full_text, r'abstract', r'introduction|keywords|1\s+introduction'),
            'introduction': extract_section(full_text, r'introduction|1\s+introduction', r'related work|methodology|method|background|2\s+'),
            'conclusion': extract_section(full_text, r'conclusion|conclusions|discussion and conclusion', r'references|acknowledgment|appendix|bibliography')
        }
        
        # Clean up sections
        for key in sections:
            if sections[key]:
                # Remove excessive whitespace and newlines
                sections[key] = re.sub(r'\n\s*\n\s*\n+', '\n\n', sections[key])
                sections[key] = sections[key].strip()
        
        return sections
        
    except Exception as e:
        print(f"   ‚ö†Ô∏è Marker AI extraction failed: {e}")
        return None

def extract_section(text, start_pattern, end_pattern):
    """Extract text between section headers with improved patterns."""
    import re
    
    # Case insensitive search with word boundaries
    start_match = re.search(r'\b' + start_pattern + r'\b', text, re.IGNORECASE)
    if not start_match:
        # Try without word boundaries for numbered sections
        start_match = re.search(start_pattern, text, re.IGNORECASE)
        if not start_match:
            return ""
    
    start_pos = start_match.end()
    
    # Skip section number and title line
    # Find first paragraph after header
    lines = text[start_pos:].split('\n')
    actual_start = 0
    for i, line in enumerate(lines):
        if line.strip() and len(line.strip()) > 20:  # First substantial line
            actual_start = sum(len(l) + 1 for l in lines[:i])
            break
    
    start_pos += actual_start
    
    # Find end of section
    end_match = re.search(r'\b' + end_pattern + r'\b', text[start_pos:], re.IGNORECASE)
    if not end_match:
        end_match = re.search(end_pattern, text[start_pos:], re.IGNORECASE)
    
    if end_match:
        end_pos = start_pos + end_match.start()
        section_text = text[start_pos:end_pos]
    else:
        # Take next 3000 characters if no end found
        section_text = text[start_pos:start_pos+3000]
    
    return section_text.strip()

def create_selective_markdown(paper_uuid, metadata, sections):
    """Create markdown with only important sections."""
    os.makedirs(MARKDOWN_FOLDER, exist_ok=True)
    
    title = metadata['title']
    safe_name = sanitize_filename(title)
    md_filename = f"{paper_uuid[:8]}_{safe_name}.md"
    md_path = os.path.join(MARKDOWN_FOLDER, md_filename)
    
    # Build markdown content
    content = f"""# {title}

**UUID:** `{paper_uuid}`

---

## üìã Metadata

| Field | Value |
|-------|-------|
| **Authors** | {', '.join(metadata['authors']) if isinstance(metadata['authors'], list) else metadata['authors']} |
| **Year** | {metadata.get('year', 'N/A')} |
| **Source** | {metadata['source']} |
| **arXiv ID** | {metadata.get('arxiv_id', 'N/A')} |
| **DOI** | {metadata.get('doi', 'N/A')} |
| **Categories** | {', '.join(metadata.get('categories', [])) if isinstance(metadata.get('categories'), list) else 'N/A'} |

**üîó URL:** [{metadata['url']}]({metadata['url']})

---

## üìñ Citation

```
{metadata['citation']}
```

---

## üìù Abstract

{sections.get('abstract', metadata.get('abstract', 'Not available'))}

---

## üîç Introduction

{sections.get('introduction', '*Section not extracted or not found in document*')}

---

## üéØ Conclusion

{sections.get('conclusion', '*Section not extracted or not found in document*')}

---

## üìé File Information

- **PDF Location:** `{metadata.get('pdf_path', 'N/A')}`
- **Extracted:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- **Extraction Method:** Marker AI

---

*This markdown contains only key sections (Abstract, Introduction, Conclusion) for quick reference. See the full PDF for complete content including Methods, Results, Discussion, and References.*
"""
    
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write(content)
    
    return md_path


In [11]:

# ---------------------------------------------------------
# ARXIV HANDLERS
# ---------------------------------------------------------
def build_arxiv_query(topic, filters):
    """Build advanced arXiv query with filters."""
    query_parts = [topic]
    
    if filters.get('categories'):
        cat_query = ' OR '.join([f'cat:{cat}' for cat in filters['categories']])
        query_parts.append(f"({cat_query})")
    
    if filters.get('author'):
        query_parts.append(f'au:{filters["author"]}')
    
    if filters.get('title_contains'):
        query_parts.append(f'ti:{filters["title_contains"]}')
    
    return ' AND '.join(query_parts)

def get_arxiv(query, filters=None):
    """Get arXiv papers with advanced filtering options."""
    if filters is None:
        filters = {}
    
    search_query = build_arxiv_query(query, filters)
    
    sort_by = filters.get('sort_by', 'announced')
    if sort_by == 'announced':
        sort_criterion = arxiv.SortCriterion.LastUpdatedDate
    elif sort_by == 'submitted':
        sort_criterion = arxiv.SortCriterion.SubmittedDate
    else:
        sort_criterion = arxiv.SortCriterion.Relevance
    
    max_results = filters.get('max_results', MAX_RESULTS)
    
    client = arxiv.Client()
    search = arxiv.Search(
        query=search_query,
        max_results=max_results * 2,
        sort_by=sort_criterion,
        sort_order=arxiv.SortOrder.Descending
    )
    
    papers = list(client.results(search))
    
    if filters.get('date_from') or filters.get('date_to'):
        filtered_papers = []
        for paper in papers:
            paper_date = paper.published
            
            if filters.get('date_from') and paper_date < filters['date_from']:
                continue
            if filters.get('date_to') and paper_date > filters['date_to']:
                continue
                
            filtered_papers.append(paper)
        papers = filtered_papers
    
    return papers[:max_results]

In [12]:

# ---------------------------------------------------------
# PUBMED HANDLERS
# ---------------------------------------------------------
def get_pubmed(query):
    Entrez.email = NCBI_EMAIL
    Entrez.api_key = NCBI_API_KEY

    try:
        handle = Entrez.esearch(db="pmc", term=query, retmax=MAX_RESULTS)
        record = Entrez.read(handle)
        handle.close()

        ids = record.get("IdList", [])
        if not ids:
            return []

        handle = Entrez.esummary(db="pmc", id=",".join(ids))
        summaries = Entrez.read(handle)
        handle.close()

        papers = []
        for summary in summaries:
            papers.append({
                'pmc_id': f"PMC{summary.get('Id', '')}",
                'title': summary.get('Title', 'No title'),
                'authors': summary.get('AuthorList', []),
                'date': summary.get('PubDate', 'N/A'),
                'source': summary.get('Source', ''),
            })
        return papers

    except Exception as e:
        print(f"   ‚ùå PubMed API Error: {e}")
        return []

def get_pubmed_abstract(pmc_id):
    """Fetch abstract for a single PMC article."""
    try:
        handle = Entrez.efetch(db="pmc", id=pmc_id.replace("PMC", ""), rettype="xml")
        content = handle.read()
        handle.close()
        if b'<abstract>' in content:
            start = content.find(b'<abstract>') + 10
            end = content.find(b'</abstract>')
            abstract = content[start:end].decode('utf-8', errors='ignore')
            abstract = re.sub(r'<[^>]+>', '', abstract).strip()
            return abstract
    except:
        pass
    return "Abstract not available"

def download_pubmed_pdf(pmc_id, pdf_path):
    """Download PDF from PubMed Central."""
    import tarfile
    import io
    
    oa_url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={pmc_id}"
    
    try:
        r = requests.get(oa_url, timeout=30)
        if r.status_code != 200:
            return False

        tgz_match = re.search(r'href="(ftp://[^"]+\.tar\.gz)"', r.text)
        if not tgz_match:
            return False

        tgz_url = tgz_match.group(1)
        tgz_url = tgz_url.replace("ftp://ftp.ncbi.nlm.nih.gov/", "https://ftp.ncbi.nlm.nih.gov/")

        r2 = requests.get(tgz_url, headers={'User-Agent': 'Mozilla/5.0'}, timeout=120)
        if r2.status_code != 200:
            return False

        tar_bytes = io.BytesIO(r2.content)
        with tarfile.open(fileobj=tar_bytes, mode='r:gz') as tar:
            for member in tar.getmembers():
                if member.name.endswith('.pdf'):
                    pdf_file = tar.extractfile(member)
                    if pdf_file:
                        with open(pdf_path, 'wb') as f:
                            f.write(pdf_file.read())
                        return True
        return False

    except Exception as e:
        print(f"      Error: {e}")
        return False

In [13]:

# ---------------------------------------------------------
# DISPLAY HELPERS
# ---------------------------------------------------------
def display_results(papers, source_type):
    print("\n" + "=" * 160)
    if source_type == '1':
        print(f"{'#':<4} | {'arXiv ID':<15} | {'Published':<12} | {'Updated':<12} | {'Categories':<20} | {'Title':<60}")
    else:
        print(f"{'#':<4} | {'PMC ID':<15} | {'Date':<12} | {'Source':<30} | {'Title':<60}")
    print("=" * 160)

    for i, p in enumerate(papers):
        if source_type == '1':
            arxiv_id = p.get_short_id()
            published = p.published.strftime('%Y-%m-%d')
            updated = p.updated.strftime('%Y-%m-%d') if p.updated else 'N/A'
            categories = ', '.join(p.categories[:2]) if len(p.categories) > 2 else ', '.join(p.categories)
            if len(categories) > 18:
                categories = categories[:15] + "..."
            title = p.title.replace('\n', ' ')
            if len(title) > 57:
                title = title[:54] + "..."
            
            print(f"{i+1:<4} | {arxiv_id:<15} | {published:<12} | {updated:<12} | {categories:<20} | {title:<60}")
        else:
            pmc_id = p['pmc_id']
            date = str(p['date'])[:12] if p['date'] else "N/A"
            source = p['source'][:28] if len(p['source']) > 28 else p['source']
            title = p['title'].replace('\n', ' ')
            if len(title) > 57:
                title = title[:54] + "..."
            
            print(f"{i+1:<4} | {pmc_id:<15} | {date:<12} | {source:<30} | {title:<60}")

    print("=" * 160 + "\n")

def get_arxiv_filters():
    """Interactive filter selection for arXiv."""
    filters = {}
    
    print("\n--- arXiv Filters (press Enter to skip) ---")
    
    print("\nSort by:")
    print("1. Latest Announcement (default)")
    print("2. Submission Date")
    print("3. Relevance")
    sort_choice = input("Choose (1-3): ").strip()
    if sort_choice == '2':
        filters['sort_by'] = 'submitted'
    elif sort_choice == '3':
        filters['sort_by'] = 'relevance'
    else:
        filters['sort_by'] = 'announced'
    
    print("\nAvailable Categories:")
    for i, (code, name) in enumerate(ARXIV_CATEGORIES.items(), 1):
        print(f"{i}. {code} - {name}")
    
    cat_input = input("\nEnter category numbers (comma-separated, e.g., 1,2): ").strip()
    if cat_input:
        try:
            indices = [int(x.strip()) - 1 for x in cat_input.split(',')]
            cat_list = list(ARXIV_CATEGORIES.keys())
            filters['categories'] = [cat_list[i] for i in indices if 0 <= i < len(cat_list)]
        except:
            print("Invalid input, skipping categories")
    
    date_input = input("\nLast N days (e.g., 7 for last week): ").strip()
    if date_input:
        try:
            days = int(date_input)
            filters['date_from'] = datetime.now() - timedelta(days=days)
        except:
            print("Invalid input, skipping date filter")
    
    author = input("\nFilter by author name: ").strip()
    if author:
        filters['author'] = author
    
    return filters


In [14]:

# ---------------------------------------------------------
# MAIN EXECUTION
# ---------------------------------------------------------
if __name__ == "__main__":
    print("=" * 80)
    print("  üìö PAPER DOWNLOAD & EXTRACTION SYSTEM V6 (Marker AI)")
    print("=" * 80)
    print(f"  üìÅ PDFs saved to: {PDF_FOLDER}/")
    print(f"  üìù Markdown saved to: {MARKDOWN_FOLDER}/")
    print(f"  üóÑÔ∏è  Metadata database: {METADATA_FILE}")
    print("=" * 80)
    
    if not MARKER_AVAILABLE:
        print("\n‚ö†Ô∏è  WARNING: Marker AI not available!")
        print("   PDF downloads will work, but extraction will be skipped.")
        print("   Install with: pip install marker-pdf --upgrade")
        continue_anyway = input("\n   Continue anyway? (y/n): ").lower().strip()
        if continue_anyway != 'y':
            exit()
    
    # Initialize database
    db = PaperDatabase()
    
    print("\n1. arXiv")
    print("2. PubMed (PMC Open Access)")

    choice = input("Select Source (1 or 2): ").strip()
    
    if choice not in ['1', '2']:
        print("‚ùå Invalid choice. Please select 1 or 2.")
        exit()
    
    topic = input("Enter search topic: ").strip()
    
    if not topic:
        print("‚ùå Search topic cannot be empty.")
        exit()
    
    filters = None
    if choice == '1':
        use_filters = input("Use advanced filters? (y/n): ").lower().strip()
        if use_filters == 'y':
            filters = get_arxiv_filters()

    print(f"\nüîç Searching for '{topic}' (Max {MAX_RESULTS})...")

    if choice == '1':
        papers = get_arxiv(topic, filters)
    else:
        papers = get_pubmed(topic)

    if not papers:
        print("‚ùå No results found.")
        exit()

    display_results(papers, choice)

    print("\nüì• DOWNLOAD OPTIONS:")
    print("1. Download all PDFs only")
    print("2. Download all PDFs + Extract to Markdown")
    print("3. Select specific papers")
    print("4. Quit")
    
    mode = input("\nYour choice (1-4): ").strip()

    if mode == '1':
        print("\nüöÄ Downloading PDFs only...")
        for paper in papers:
            download_and_save_pdf(paper, choice, db)
            time.sleep(DOWNLOAD_DELAY)
        print(f"\n‚úÖ Complete! Check {PDF_FOLDER}/ and {METADATA_FILE}")
        
    elif mode == '2':
        print("\nüöÄ Downloading PDFs and extracting to Markdown...")
        for paper in papers:
            success, paper_uuid = download_and_save_pdf(paper, choice, db)
            if success and paper_uuid:
                # Get metadata
                paper_data = db.get_paper(paper_uuid)
                pdf_path = paper_data['pdf_path']
                
                print(f"   üìÑ Extracting sections from PDF...")
                sections = extract_sections_from_pdf(pdf_path)
                
                if sections:
                    md_path = create_selective_markdown(paper_uuid, paper_data, sections)
                    print(f"   ‚úÖ Markdown saved: {md_path}")
                    
                    # Update database with markdown path
                    paper_data['markdown_path'] = md_path
                    db.save()
                    
            time.sleep(DOWNLOAD_DELAY)
        print(f"\n‚úÖ Complete! Check {PDF_FOLDER}/, {MARKDOWN_FOLDER}/ and {METADATA_FILE}")
        
    elif mode == '3':
        indices = input("Enter paper numbers (comma-separated, e.g., 1,3,5): ").strip()
        try:
            selected = [int(x.strip()) - 1 for x in indices.split(',')]
            extract = input("Extract to markdown? (y/n): ").lower().strip() == 'y'
            
            for idx in selected:
                if 0 <= idx < len(papers):
                    success, paper_uuid = download_and_save_pdf(papers[idx], choice, db)
                    
                    if success and extract and paper_uuid:
                        paper_data = db.get_paper(paper_uuid)
                        pdf_path = paper_data['pdf_path']
                        
                        print(f"   üìÑ Extracting sections...")
                        sections = extract_sections_from_pdf(pdf_path)
                        
                        if sections:
                            md_path = create_selective_markdown(paper_uuid, paper_data, sections)
                            print(f"   ‚úÖ Markdown saved: {md_path}")
                            paper_data['markdown_path'] = md_path
                            db.save()
                    
                    time.sleep(DOWNLOAD_DELAY)
        except:
            print("Invalid input")
            
    else:
        print("Exiting...")
    
    print("\n" + "=" * 80)
    print(f"üìä Total papers in database: {len(db.papers)}")
    print("=" * 80)

  üìö PAPER DOWNLOAD & EXTRACTION SYSTEM V6 (Marker AI)
  üìÅ PDFs saved to: papers_pdf/
  üìù Markdown saved to: papers_markdown/
  üóÑÔ∏è  Metadata database: papers_metadata.json

1. arXiv
2. PubMed (PMC Open Access)

üîç Searching for 'llm' (Max 50)...

#    | arXiv ID        | Published    | Updated      | Categories           | Title                                                       
1    | 2512.23090v2    | 2025-12-28   | 2026-01-02   | cs.AI, cs.LG         | Benchmark Success, Clinical Failure: When Reinforcemen...   
2    | 2501.19107v3    | 2025-01-31   | 2026-01-02   | cs.LG                | Brain network science modelling of sparse neural netwo...   
3    | 2601.00770v1    | 2026-01-02   | 2026-01-02   | cs.CE, cs.AI         | LLM Agents for Combinatorial Efficient Frontiers: Inve...   
4    | 2506.01495v5    | 2025-06-02   | 2026-01-02   | cs.CL                | C-VARC: A Large-Scale Chinese Value Rule Corpus for Va...   
5    | 2601.00756v1    | 2026-01-02   | 202

Recognizing Layout: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21/21 [01:51<00:00,  5.31s/it]
Running OCR Error Detection: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:01<00:00,  4.28it/s]
Detecting bboxes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.28s/it]
Recognizing Text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 14/14 [05:51<00:00, 25.10s/it]
Recognizing Text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [00:37<00:00,  9.38s/it]
Recognizing tables: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:06<00:00,  6.40s/it]
Detecting bboxes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.31s/it]
Recognizing Text: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 167/167 [01:18<00:00,  2.14it/s]


   üìù Extracted 57884 characters
   ‚úÖ Markdown saved: papers_markdown\8d2f27af_Benchmark_Success_Clinical_Failure_When_Reinforcem.md
   ‚¨áÔ∏è Downloading: Brain network science modelling of sparse neural n...
   ‚úÖ PDF saved: papers_pdf\2501.19107v3_Brain_network_science_modelling_of_sparse_neural_n.pdf
   üÜî UUID: e2b2a0f7-26ab-4625-9a93-931eca5fe1b5
   üìù Citation: Yingtao Zhang et al. (2025). Brain network science modelling of sparse neural ne...
   üìÑ Extracting sections from PDF...
   ü§ñ Processing with Marker AI...


Recognizing Layout: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 37/37 [05:34<00:00,  9.04s/it]
Running OCR Error Detection: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:03<00:00,  2.64it/s]
Detecting bboxes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:08<00:00,  4.27s/it]
Recognizing Text:  91%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 147/161 [09:42<00:03,  3.57it/s] 

KeyboardInterrupt: 