# ScienceSage Sanity Check Notebook

This notebook verifies that:

1. Text chunks exist
2. Qdrant embeddings were created
3. Retrieval returns reasonable results

You can run this before starting the Streamlit app.

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import json
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchAny, MatchValue
from sciencesage.config import OPENAI_API_KEY, QDRANT_URL, QDRANT_COLLECTION
from collections import Counter

[32m2025-09-14 22:21:12.120[0m | [1mINFO    [0m | [36msciencesage.config[0m:[36m<module>[0m:[36m122[0m - [1mConfiguration loaded.[0m


In [2]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [3]:
client = OpenAI(api_key=OPENAI_API_KEY)
qdrant = QdrantClient(url=QDRANT_URL)

### 1️⃣ Check processed chunks

In [4]:
chunks_file = Path('../data/chunks/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 1250
Sample chunk:
{'id': 'wikipedia_black_hole_0_0_8baf7250effb', 'uuid': '4fe1a8d3-90c2-50f0-8b43-fc68f8a98dc5', 'topics': ['Space', 'AI'], 'topic': 'Space', 'title': 'wikipedia_black_hole', 'url': None, 'image_url': None, 'images': [], 'matched_keywords': ['space', 'universe', 'galaxy', 'star', 'planet', 'black hole', 'NASA', 'solar system', 'earth', 'moon', 'saturn', 'gravity', 'ai', 'emissions', 'environment'], 'source': 'wikipedia_black_hole', 'chunk_index': 0, 'text': 'A black hole is an astronomical body so dense that its gravity prevents anything from escaping, even light. Albert Einstein\'s theory of general relativity predicts that a sufficiently compact mass will form a black hole.[4] The boundary of no escape is called the event horizon. In general relativity, a black hole\'s event horizon seals an object\'s fate but produces no locally detectable change when crossed.[5] In many ways, a black hole acts like an ideal black body, as it reflects no light.[6][7] 

### 2️⃣ Test embedding a query

In [13]:
query = "What is the difference between dark matter and dark energy?"
embedding_model = "text-embedding-3-small"

response = client.embeddings.create(model=embedding_model, input=query)
query_vector = response.data[0].embedding
print(f"Embedding length: {len(query_vector)}")

Embedding length: 1536


### 3️⃣ Test Qdrant similarity search

In [14]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
    query_filter=Filter(
        must=[FieldCondition(key="topics", match=MatchValue(value="dark matter"))]
    ),
)

print("Top 3 retrieved chunks:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks:
No chunks retrieved. Check your Qdrant collection and query.


In [15]:
client = QdrantClient("http://localhost:6333")
print(client.get_collections())

collections=[CollectionDescription(name='scientific_concepts')]


In [16]:
client.count("scientific_concepts", exact=True)

CountResult(count=1250)

In [17]:
all_topics = []
scroll_limit = 100  # adjust as needed for large collections
offset = None

while True:
    scroll_result = qdrant.scroll(
        collection_name=QDRANT_COLLECTION,
        limit=scroll_limit,
        offset=offset,
        with_payload=True,
        with_vectors=False,
    )
    points = scroll_result[0]
    if not points:
        break
    for p in points:
        # Handle both 'topic' (string) and 'topics' (list)
        if "topics" in p.payload and isinstance(p.payload["topics"], list):
            all_topics.extend(p.payload["topics"])
        elif "topic" in p.payload and isinstance(p.payload["topic"], str):
            all_topics.append(p.payload["topic"])
    if len(points) < scroll_limit:
        break
    offset = points[-1].id  # continue from last point

unique_topics = sorted(set(all_topics))
print(f"Unique topics in collection ({len(unique_topics)}):")
for t in unique_topics:
    print("-", t)
print("\nTopic counts:", Counter(all_topics))

Unique topics in collection (4):
- AI
- Climate
- Other
- Space

Topic counts: Counter({'AI': 1248, 'Space': 966, 'Climate': 547, 'Other': 5})


In [18]:
for topic in unique_topics:
    print(f"\n=== Top 3 retrieved chunks for topic: {topic} ===")
    results = qdrant.query_points(
        collection_name=QDRANT_COLLECTION,
        query=query_vector,
        limit=3,
        query_filter=Filter(
            must=[FieldCondition(key="topics", match=MatchValue(value=topic))]
        ),
    )
    if hasattr(results, "points") and results.points:
        for p in results.points:
            text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
            print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
    else:
        print("No chunks retrieved. Check your Qdrant collection and query.")


=== Top 3 retrieved chunks for topic: AI ===
ID: 13b0cd21-d47e-5b7a-8560-6f8ee4833fb5, text snippet: similar to dark energy, resulted in an enormous and exponential expansion of the universe during its earliest stages. Such expansion is an essential f...
ID: fdd9c98e-f031-5853-a5df-5712fb0d690b, text snippet: In physical cosmology and astronomy, dark energy is a proposed form of energy that affects the universe on the largest scales. Its primary effect is t...
ID: 478c7656-cc80-59b1-b075-fe1fb657d360, text snippet: ordinary matter.[22]: III.A Further indications of mass-to-light ratio anomalies came from measurements of galaxy rotation curves. In 1939, H.W. Babco...

=== Top 3 retrieved chunks for topic: Climate ===
ID: f4d73bf0-447e-534f-92c1-d6e681542b45, text snippet: units of magnitudes per square arcsecond (mag/arcsec2; sometimes expressed as mag arcsec−2), which defines the brightness depth of the isophote. To il...
ID: 092ee698-c703-5ab7-bc1d-2369004e85ed, text snippet: Way, as