# ScienceSage Sanity Check Notebook

This notebook verifies that:

1. Text chunks exist
2. Qdrant embeddings were created
3. Retrieval returns reasonable results

You can run this before starting the Streamlit app.

In [12]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import json
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from app.config import OPENAI_API_KEY, QDRANT_URL, QDRANT_COLLECTION
from collections import Counter

In [2]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [3]:
client = OpenAI(api_key=OPENAI_API_KEY)
qdrant = QdrantClient(url=QDRANT_URL)

### 1️⃣ Check processed chunks

In [4]:
chunks_file = Path('../data/chunks/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 214
Sample chunk:
{'id': 'nasa_causes_30d637499af3', 'uuid': '17af4ab7-888c-56c2-9cc5-ec6d1ee4defc', 'topic': 'nasa', 'source': 'nasa_causes', 'chunk_index': 0, 'text': 'Takeaways - The greenhouse effect is essential to life on Earth, but human-made emissions in the atmosphere are trapping and slowing heat loss to space. - Five key greenhouse gases are carbon dioxide, nitrous oxide, methane, chlorofluorocarbons, and water vapor. - While the Sun has played a role in past climate changes, the evidence shows the current warming cannot be explained by the Sun. Increasing Greenhouses Gases Are Warming the Planet Scientists attribute the global warming trend observed since the mid-20th century to the human expansion of the "greenhouse effect"1 — warming that results when the atmosphere traps heat radiating from Earth toward space. Life on Earth depends on energy coming from the Sun. About half the light energy reaching Earth\'s atmosphere passes through the air and clouds to th

### 2️⃣ Test embedding a query

In [5]:
query = "What is neuroplasticity?"
embedding_model = "text-embedding-3-small"

response = client.embeddings.create(model=embedding_model, input=query)
query_vector = response.data[0].embedding
print(f"Embedding length: {len(query_vector)}")

Embedding length: 1536


### 3️⃣ Test Qdrant similarity search

In [6]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
    query_filter=Filter(
        must=[FieldCondition(key="topic", match=MatchValue(value="neuroplasticity"))]
    ),
)

print("Top 3 retrieved chunks:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks:
ID: 2b8442ef-57a5-5d2a-ac27-e7b0b50090e1, text snippet: Neuroplasticity Neuroplasticity, also known as neural plasticity or just plasticity, is the ability of neural networks in the brain to change through ...
ID: 948a065c-69a7-5be4-be4e-a78b2e7414cc, text snippet: Clinton Woosley. The experiment was based on observation of what occurred in the brain when one peripheral nerve was cut and subsequently regenerated....
ID: 57b1244a-bb34-55dd-98a4-2d0068dffe35, text snippet: Up until the 1970s, neuroscientists believed that the brain's structure and function was essentially fixed throughout adulthood.[23] While the brain w...


In [8]:
client = QdrantClient("http://localhost:6333")
print(client.get_collections())

collections=[CollectionDescription(name='scientific_concepts')]


In [10]:
client.count("scientific_concepts", exact=True)

CountResult(count=214)

In [13]:
all_topics = []
scroll_limit = 100  # adjust as needed for large collections
offset = None

while True:
    scroll_result = qdrant.scroll(
        collection_name=QDRANT_COLLECTION,
        limit=scroll_limit,
        offset=offset,
        with_payload=True,
        with_vectors=False,
    )
    points = scroll_result[0]
    if not points:
        break
    for p in points:
        topic = p.payload.get("topic")
        if topic:
            all_topics.append(topic)
    if len(points) < scroll_limit:
        break
    offset = points[-1].id  # continue from last point

unique_topics = sorted(set(all_topics))
print(f"Unique topics in collection ({len(unique_topics)}):")
for t in unique_topics:
    print("-", t)
print("\nTopic counts:", Counter(all_topics))

Unique topics in collection (9):
- animal
- climate
- decline
- large
- nasa
- neuroplasticity
- reinforcement
- retrieval
- transformer

Topic counts: Counter({'climate': 64, 'nasa': 32, 'large': 29, 'neuroplasticity': 26, 'transformer': 23, 'reinforcement': 17, 'decline': 10, 'animal': 9, 'retrieval': 6})


In [14]:
for topic in unique_topics:
    print(f"\n=== Top 3 retrieved chunks for topic: {topic} ===")
    results = qdrant.query_points(
        collection_name=QDRANT_COLLECTION,
        query=query_vector,
        limit=3,
        query_filter=Filter(
            must=[FieldCondition(key="topic", match=MatchValue(value=topic))]
        ),
    )
    if hasattr(results, "points") and results.points:
        for p in results.points:
            text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
            print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
    else:
        print("No chunks retrieved. Check your Qdrant collection and query.")


=== Top 3 retrieved chunks for topic: animal ===
ID: 361c104f-3ed7-5ef9-ae01-07a196999ff6, text snippet: Animal migration Animal migration is the relatively long-distance movement of individual animals, usually on a seasonal basis. It is the most common f...
ID: 33e03d94-29eb-5e76-946c-1741c42f35ca, text snippet: There is scope for further development of systems able to track small animals globally.[46] Radio-tracking tags can be fitted to insects, including dr...
ID: 32ccdec2-4d17-5ed7-8903-77108680700b, text snippet: Some species such as Pacific salmon migrate to reproduce; every year, they swim upstream to mate and then return to the ocean.[8] Temperature is a dri...

=== Top 3 retrieved chunks for topic: climate ===
ID: 5b2d9587-6083-54dc-8d8f-2c9cdfdc03be, text snippet: and cities with green garden spaces. These can reduce heat stress and food insecurity for low-income neighbourhoods.[30]: 800 Ecosystem-based adaptati...
ID: 51b6f439-4f9a-58cd-b208-6c53a5efb4a1, text snippet: 10 