# ScienceSage Sanity Check Notebook

This notebook verifies that:

1. Text chunks exist
2. Qdrant embeddings were created
3. Retrieval returns reasonable results

You can run this before starting the Streamlit app.

In [2]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import json
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from config.config import OPENAI_API_KEY, QDRANT_URL, QDRANT_COLLECTION
from collections import Counter

[32m2025-09-01 18:22:37.016[0m | [1mINFO    [0m | [36mconfig.config[0m:[36m<module>[0m:[36m108[0m - [1mConfiguration loaded.[0m


In [3]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [4]:
client = OpenAI(api_key=OPENAI_API_KEY)
qdrant = QdrantClient(url=QDRANT_URL)

### 1️⃣ Check processed chunks

In [5]:
chunks_file = Path('../data/chunks/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 24
Sample chunk:
{'id': 'nasa_causes_0_0_76450290dd5d', 'uuid': '66750ee4-ee46-56f8-be55-58352298f1e7', 'topics': ['AI', 'Renewable Energy & Climate Change', 'Ecosystem Interactions'], 'topic': 'AI', 'source': 'nasa_causes', 'chunk_index': 0, 'text': 'Takeaways - The greenhouse effect is essential to life on Earth, but human-made emissions in the atmosphere are trapping and slowing heat loss to space. - Five key greenhouse gases are carbon dioxide, nitrous oxide, methane, chlorofluorocarbons, and water vapor. - While the Sun has played a role in past climate changes, the evidence shows the current warming cannot be explained by the Sun. Increasing Greenhouses Gases Are Warming the Planet Scientists attribute the global warming trend observed since the mid-20th century to the human expansion of the "greenhouse effect"1 — warming that results when the atmosphere traps heat radiating from Earth toward space. Life on Earth depends on energy coming from the Sun. About half the

### 2️⃣ Test embedding a query

In [6]:
query = "What is neuroplasticity?"
embedding_model = "text-embedding-3-small"

response = client.embeddings.create(model=embedding_model, input=query)
query_vector = response.data[0].embedding
print(f"Embedding length: {len(query_vector)}")

Embedding length: 1536


### 3️⃣ Test Qdrant similarity search

In [8]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
    query_filter=Filter(
        must=[FieldCondition(key="topics", match=MatchValue(value="Neuroplasticity"))]
    ),
)

print("Top 3 retrieved chunks:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks:
ID: e6867d3c-612a-5388-b977-21a1a08d6fcf, text snippet: Neuroplasticity, also known as neural plasticity or just plasticity, is the medium of neural networks in the brain to change through growth and reorga...
ID: 46334a06-8b98-5b39-8d34-9f896accc59b, text snippet: Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take a...
ID: 0c448268-7be0-5576-9dbb-fe2178551b95, text snippet: A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language...


In [9]:
client = QdrantClient("http://localhost:6333")
print(client.get_collections())

collections=[CollectionDescription(name='scientific_concepts')]


In [10]:
client.count("scientific_concepts", exact=True)

CountResult(count=24)

In [12]:
all_topics = []
scroll_limit = 100  # adjust as needed for large collections
offset = None

while True:
    scroll_result = qdrant.scroll(
        collection_name=QDRANT_COLLECTION,
        limit=scroll_limit,
        offset=offset,
        with_payload=True,
        with_vectors=False,
    )
    points = scroll_result[0]
    if not points:
        break
    for p in points:
        # Handle both 'topic' (string) and 'topics' (list)
        if "topics" in p.payload and isinstance(p.payload["topics"], list):
            all_topics.extend(p.payload["topics"])
        elif "topic" in p.payload and isinstance(p.payload["topic"], str):
            all_topics.append(p.payload["topic"])
    if len(points) < scroll_limit:
        break
    offset = points[-1].id  # continue from last point

unique_topics = sorted(set(all_topics))
print(f"Unique topics in collection ({len(unique_topics)}):")
for t in unique_topics:
    print("-", t)
print("\nTopic counts:", Counter(all_topics))

Unique topics in collection (5):
- AI
- Animal Adaptation
- Ecosystem Interactions
- Neuroplasticity
- Renewable Energy & Climate Change

Topic counts: Counter({'AI': 24, 'Renewable Energy & Climate Change': 21, 'Animal Adaptation': 16, 'Ecosystem Interactions': 11, 'Neuroplasticity': 7})


In [None]:
for topic in unique_topics:
    print(f"\n=== Top 3 retrieved chunks for topic: {topic} ===")
    results = qdrant.query_points(
        collection_name=QDRANT_COLLECTION,
        query=query_vector,
        limit=3,
        query_filter=Filter(
            must=[FieldCondition(key="topics", match=MatchValue(value=topic))]
        ),
    )
    if hasattr(results, "points") and results.points:
        for p in results.points:
            text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
            print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
    else:
        print("No chunks retrieved. Check your Qdrant collection and query.")


=== Top 3 retrieved chunks for topic: AI ===
No chunks retrieved. Check your Qdrant collection and query.

=== Top 3 retrieved chunks for topic: Animal Adaptation ===
No chunks retrieved. Check your Qdrant collection and query.

=== Top 3 retrieved chunks for topic: Ecosystem Interactions ===
No chunks retrieved. Check your Qdrant collection and query.

=== Top 3 retrieved chunks for topic: Neuroplasticity ===
No chunks retrieved. Check your Qdrant collection and query.

=== Top 3 retrieved chunks for topic: Renewable Energy & Climate Change ===
No chunks retrieved. Check your Qdrant collection and query.
