<a href="https://colab.research.google.com/github/lekejo/lekejo/blob/main/zoterorag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyzotero chromadb thepipe-api tqdm

In [None]:
from pyzotero import zotero
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_page
from openai import OpenAI
from tqdm import tqdm
import chromadb
import time
import os

# Set up environment variables
os.environ["ZOTERO_USER_ID"] = "..."
os.environ["ZOTERO_API_KEY"] = "..."
os.environ["THEPIPE_API_KEY"] = "..."
os.environ["LLM_SERVER_API_KEY"] = "..."
os.environ["LLM_SERVER_BASE_URL"] = "..."

In [None]:
# Initialize ChomaDB
chroma_client = chromadb.PersistentClient(path="chromadb")
collection = chroma_client.get_or_create_collection(name="zotero_papers")

# Initialize LLM client
llm_client = OpenAI(
    base_url=os.environ["LLM_SERVER_BASE_URL"],
    api_key=os.environ["LLM_SERVER_API_KEY"],
)

# Initialize Zotero client for user (use group id and "group" for group libraries)
zot = zotero.Zotero(
    library_id=os.environ.get("ZOTERO_USER_ID"),
    library_type="user",
    api_key=os.environ.get("ZOTERO_API_KEY")
)

In [None]:
from thepipe.scraper import scrape_file
from thepipe.chunker import chunk_by_page
from tqdm import tqdm
import time

# Create 'pdfs' directory if it doesn't exist
os.makedirs("pdfs", exist_ok=True)

# Retrieve all items
items = zot.everything(zot.top())

for item in tqdm(items):
    if 'contentType' in item['data'] and item['data']['contentType'] == 'application/pdf':
        item_key = item['data']['key']
        filename = item['data'].get('filename', None)

        # Skip if not a PDF
        if not filename or not filename.endswith('.pdf'):
            continue

        file_path = os.path.join("pdfs", filename)

        # Download the file
        with open(file_path, 'wb') as f:
            f.write(zot.file(item_key))
        print(f"Downloaded: {filename}")

        # Scrape the file
        chunks = scrape_file(file_path, ai_extraction=True, text_only=True, local=True, chunking_method=chunk_by_page)
        print(f"Scraped {len(chunks)} chunks from {filename}")

        # Add chunks to collection
        for chunk in chunks:
            chunk_text = '\n'.join(chunk.texts)
            collection.add(
                documents=[chunk_text],
                metadatas=[{"source": chunk.path}],
                ids=[str(time.time_ns())],
            )
        print(f"Added {len(chunks)} chunks to collection")

  0%|          | 0/4 [00:00<?, ?it/s]

Downloaded: Finnerty_2024_AJ_167_43.pdf
Scraped 13 chunks from Finnerty_2024_AJ_167_43.pdf


 25%|██▌       | 1/4 [01:20<04:01, 80.57s/it]

Added 13 chunks to collection
Downloaded: temp6172320900388794792.pdf


100%|██████████| 4/4 [01:59<00:00, 29.82s/it]

Scraped 4 chunks from temp6172320900388794792.pdf
Added 4 chunks to collection





In [None]:
# Example query for retrieval-augmented generation
query = "Which figure shows retrieved P−T profile, maximum-likelihood spectra, and opacities? And for which chemicals does it show this data?"

# Query the collection
results = collection.query(
    query_texts=[query],
    n_results=3  # Retrieve top 3 most relevant chunks
)

# Prepare context from retrieved chunks
# context = "\n".join(results['documents'][0])

# if you want cited sources, you can use the following code
context = ""
for source, text in zip(results['metadatas'][0], results['documents'][0]):
    context += f"<Document source='{source['source']}'>\n{text}\n</Document>\n"

print("Retrieved context to use for LLM generation:")
print(context)

Retrieved context to use for LLM generation:
<Document source='pdfs\Finnerty_2024_AJ_167_43.pdf'>
# The Astronomical Journal, 167:43 (13pp), 2024 January

## Figure 3

### Retrieved P−T Profile
- **Top Left**: Retrieved P−T profile
- **Top Right**: Maximum-likelihood emission contribution function
- **Middle**: Maximum-likelihood planet spectrum
- **Bottom**: Opacities for H₂O, CO, NH₃, and CH₄

The observed NIRSPEC orders are shaded in gray. In addition to the maximum-likelihood and median P−T profiles, the top left also includes the corresponding cloud-top pressures as dashed horizontal lines and the P−T profiles from 100 draws from the retrieved posterior. 

While several parameters are poorly constrained in the corner plots, the actual P−T profiles follow a tight distribution. The emission contribution function shows the emission mostly arises near 100 mbar, just above the cloud deck, with contribution from higher altitudes in the CO line cores. 

The dashed blue line plotted with 

In [None]:
# Prepare messages for OpenRouter
messages = [
    {"role": "system", "content": "You are a helpful scientific assistant. Use the provided context to answer the user's question."},
    {"role": "user", "content": f"Context:\n{context}\nUser query: {query}"}
]

# Call OpenRouter API
response = llm_client.chat.completions.create(
    model="meta-llama/llama-3.1-405b-instruct",
    messages=messages,
    temperature=0.2
)

# Get text from response
response_text = response.choices[0].message.content
print("LLM generation:", response_text)

LLM generation: 

Figure 3 shows the retrieved P−T profile, maximum-likelihood emission contribution function, maximum-likelihood planet spectrum, and opacities for H2O, CO, NH3, and CH4.
