# ScienceSage Sanity Check Notebook

This notebook verifies that:

1. Text chunks exist
2. Qdrant embeddings were created
3. Retrieval returns reasonable results

You can run this before starting the Streamlit app.

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import json
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchAny, MatchValue
from sciencesage.config import OPENAI_API_KEY, QDRANT_URL, QDRANT_COLLECTION
from collections import Counter

[32m2025-09-20 22:27:54.429[0m | [1mINFO    [0m | [36msciencesage.config[0m:[36m<module>[0m:[36m131[0m - [1mConfiguration loaded.[0m


In [2]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [3]:
client = OpenAI(api_key=OPENAI_API_KEY)
qdrant = QdrantClient(url=QDRANT_URL)

### 1️⃣ Check processed chunks

In [4]:
chunks_file = Path('../data/chunks/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 1100
Sample chunk:
{'id': 'wikipedia_black_hole_0_0_0b36e8aa3de3', 'uuid': 'c6080002-814d-5433-8852-0cf14bd20586', 'text': 'A black hole is an astronomical body so dense that its gravity prevents anything from escaping, even light. Albert Einstein\'s theory of general relativity predicts that a sufficiently compact mass will form a black hole. The boundary of no escape is called the event horizon. In general relativity, a black hole\'s event horizon seals an object\'s fate but produces no locally detectable change when crossed. In many ways, a black hole acts like an ideal black body, as it reflects no light. Quantum field theory in curved spacetime predicts that event horizons emit Hawking radiation, with the same spectrum as a black body of a temperature inversely proportional to its mass. This temperature is of the order of billionths of a kelvin for stellar black holes, making it essentially impossible to observe directly. Objects whose gravitational fields are too st

### 2️⃣ Test embedding a query

In [5]:
query = "What is the difference between dark matter and dark energy?"
embedding_model = "text-embedding-3-small"

response = client.embeddings.create(model=embedding_model, input=query)
query_vector = response.data[0].embedding
print(f"Embedding length: {len(query_vector)}")

Embedding length: 1536


### 3️⃣ Test Qdrant similarity search

In [6]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
    query_filter=Filter(
        must=[FieldCondition(key="topics", match=MatchValue(value="dark matter"))]
    ),
)

print("Top 3 retrieved chunks:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks:
No chunks retrieved. Check your Qdrant collection and query.


In [7]:
client = QdrantClient("http://localhost:6333")
print(client.get_collections())

collections=[CollectionDescription(name='scientific_concepts')]


In [8]:
client.count("scientific_concepts", exact=True)

CountResult(count=1100)

In [9]:
all_topics = []
scroll_limit = 100  # adjust as needed for large collections
offset = None

while True:
    scroll_result = qdrant.scroll(
        collection_name=QDRANT_COLLECTION,
        limit=scroll_limit,
        offset=offset,
        with_payload=True,
        with_vectors=False,
    )
    points = scroll_result[0]
    if not points:
        break
    for p in points:
        # Handle both 'topic' (string) and 'topics' (list)
        if "topics" in p.payload and isinstance(p.payload["topics"], list):
            all_topics.extend(p.payload["topics"])
        elif "topic" in p.payload and isinstance(p.payload["topic"], str):
            all_topics.append(p.payload["topic"])
    if len(points) < scroll_limit:
        break
    offset = points[-1].id  # continue from last point

unique_topics = sorted(set(all_topics))
print(f"Unique topics in collection ({len(unique_topics)}):")
for t in unique_topics:
    print("-", t)
print("\nTopic counts:", Counter(all_topics))

Unique topics in collection (4):
- AI
- Climate
- Other
- Space

Topic counts: Counter({'AI': 1037, 'Space': 786, 'Climate': 447, 'Other': 37})


In [10]:
for topic in unique_topics:
    print(f"\n=== Top 3 retrieved chunks for topic: {topic} ===")
    results = qdrant.query_points(
        collection_name=QDRANT_COLLECTION,
        query=query_vector,
        limit=3,
        query_filter=Filter(
            must=[FieldCondition(key="topics", match=MatchValue(value=topic))]
        ),
    )
    if hasattr(results, "points") and results.points:
        for p in results.points:
            text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
            print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
    else:
        print("No chunks retrieved. Check your Qdrant collection and query.")


=== Top 3 retrieved chunks for topic: AI ===
ID: b207ddd6-2e93-5d3f-8e6c-16f153794761, text snippet: similar to dark energy, resulted in an enormous and exponential expansion of the universe during its earliest stages. Such expansion is an essential f...
ID: 6621bd3c-ec6d-5cea-ba11-88dfa1838a85, text snippet: In physical cosmology and astronomy, dark energy is a proposed form of energy that affects the universe on the largest scales. Its primary effect is t...
ID: f47b28dc-d58b-5fa7-88c0-91247b5689a1, text snippet: ordinary matter.: III.A Further indications of mass-to-light ratio anomalies came from measurements of galaxy rotation curves. In 1939, H.W. Babcock r...

=== Top 3 retrieved chunks for topic: Climate ===
ID: 9a5fff89-c113-58fc-b77e-b4ded7afb0bf, text snippet: units of magnitudes per square arcsecond (mag/arcsec2; sometimes expressed as mag arcsec−2), which defines the brightness depth of the isophote. To il...
ID: 2a3a4813-50aa-5536-94d3-17566b2f2e24, text snippet: Way, as