# arXiv QML Curation Demo

This notebook demonstrates how to use the `arxivqml` package to search for and curate papers.

## 1. Imports

Import the main job runner and the database query functions.

In [1]:
from arxivqml.main import run_arxiv_search_job
from arxivqml.database import get_db_collection, get_top_papers

## 2. Run the Full Curation Job

The following function will connect to the database and LLM, search for new papers across all configured categories, curate them, and save the results to MongoDB.

**Note:** Make sure your `.env` file is configured with `MONGO_URI` and `GEMINI_API_KEY`.

In [None]:
# Uncomment the line below to run the full job
# run_arxiv_search_job()

### 2.a) Full Curation Steps


In [2]:
from datetime import datetime
from arxivqml import config
from arxivqml import database
from arxivqml import arxiv_search
from arxivqml import curation

print(f"\n--- Starting new arXiv search job at {datetime.now()} ---")
# 1. Initialize connections
collection = database.get_db_collection()
llm = curation.get_llm()


--- Starting new arXiv search job at 2025-09-30 23:12:28.802410 ---
✓ Successfully connected to MongoDB: arxiv_research.qml_papers
✓ LLM initialized: models/gemini-2.0-flash-lite


E0000 00:00:1759288348.832886 4116830 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [3]:
category = config.CATEGORIES[0]

# Step 2a: Search arXiv for new papers
new_papers = arxiv_search.search_arxiv(
    category=category, 
    query=config.QUERY_STRING, 
    collection=collection
)

# Step 2b: Curate and score papers with LLM
curated_papers = curation.curate_papers(
    papers=new_papers, 
    guidance_context=config.GUIDANCE_CONTEXT, 
    llm=llm
)


Executing arXiv search in 'quant-ph' for query: 'ti:"Quantum Machine Learning" OR abs:"Quantum Machine Learni...'
Found 25 new papers.
Scored 'Quantum Annealing for Minimum Bisection Problem: A Machine L...' -> 7/10
Scored 'QUBO-based training for VQAs on Quantum Annealers...' -> 7/10
Scored 'Investigation of D-Wave quantum annealing for training Restr...' -> 6/10
Scored 'Comparison of D-Wave Quantum Annealing and Markov Chain Mont...' -> 6/10
Scored 'Minor Embedding for Quantum Annealing with Reinforcement Lea...' -> 7/10
Scored 'Quantum Annealing for Machine Learning: Applications in Feat...' -> 7/10
Scored 'Quantum Annealing Algorithms for Estimating Ising Partition ...' -> 6/10
Scored 'A quantum annealing approach to graph node embedding...' -> 8/10
Scored 'Hyperspectral image segmentation with a machine learning mod...' -> 7/10
Scored 'Quantum Annealing Feature Selection on Light-weight Medical ...' -> 7/10
Scored 'Black-box optimization and quantum annealing for filtering o...' -

In [None]:

# Step 2c: Insert curated papers into the database
if curated_papers:
    database.insert_papers(collection, curated_papers)

In [None]:
# 2. Loop through categories and process papers
for category in config.CATEGORIES:
    print(f"\n--- Processing category: {category} ---")

    # Step 2a: Search arXiv for new papers
    new_papers = arxiv_search.search_arxiv(
        category=category, 
        query=config.QUERY_STRING, 
        collection=collection
    )

    if not new_papers:
        print(f"No new papers found for category '{category}'.")
        continue

    # Step 2b: Curate and score papers with LLM
    curated_papers = curation.curate_papers(
        papers=new_papers, 
        guidance_context=config.GUIDANCE_CONTEXT, 
        llm=llm
    )

    # Step 2c: Insert curated papers into the database
    if curated_papers:
        database.insert_papers(collection, curated_papers)

print(f"\n--- Job finished at {datetime.now()} ---")



## 3. Query Existing Papers from the Database

You can also use the package to directly query the results stored in your database.

In [None]:
# 1. Get the database collection
collection = get_db_collection()

if collection is not None:
    # 2. Query the top 5 papers by relevance score
    top_papers = get_top_papers(collection, limit=5)
    total_papers = collection.count_documents({})
    
    print(f"Total papers in database: {total_papers}")
    print("--- Top Papers ---")
    
    # 3. Display the results
    for i, paper in enumerate(top_papers, 1):
        print(f"{i}. {paper['title']}")
        print(f"  Score: {paper.get('relevance_score', 'N/A')}/10")
        print(f"  Keywords: {', '.join(paper.get('keywords', []))}")
        print(f"  PDF: {paper['pdf_url']}")
        print("---")