# ScienceSage Sanity Check Notebook

This notebook verifies that:

1. Environment check
2. Text chunks exist
3. Qdrant embeddings were created
4. Qdrant connection and data check
5. Retrieval returns reasonable results

You can run this before starting the Streamlit app.

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import json
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchAny, MatchValue
from sciencesage.config import QDRANT_URL, QDRANT_COLLECTION
from collections import Counter
from sentence_transformers import SentenceTransformer

[32m2025-09-28 16:07:34.727[0m | [1mINFO    [0m | [36msciencesage.config[0m:[36m<module>[0m:[36m118[0m - [1mConfiguration loaded.[0m
  from .autonotebook import tqdm as notebook_tqdm


### 1️⃣ Environment Check

In [2]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [3]:
print(sys.version)

3.12.1 (main, Jul 10 2025, 11:57:50) [GCC 13.3.0]


In [4]:
qdrant = QdrantClient(url=QDRANT_URL)

### 2️⃣  Check processed chunks exist

In [5]:
chunks_file = Path('../data/processed/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 2713
Sample chunk:
{'uuid': '90a4913a-d951-5c8d-ac1b-ca7ff747285b', 'text': 'Discovery and exploration of the Solar System is observation, visitation, and increase in knowledge and understanding of Earth\'s "cosmic neighborhood". This includes the Sun, Earth and the Moon, the major planets Mercury, Venus, Mars, Jupiter, Saturn, Uranus, and Neptune, their satellites, as well as smaller bodies including comets, asteroids, and dust.\nIn ancient and medieval times, only objects visible to the naked eye—the Sun, the Moon, the five classical planets, and comets, along with phenomena now known to take place in Earth\'s atmosphere, like meteors and aurorae—were known. Ancient astronomers were able to make geometric observations with various instruments. The collection of precise observations in the early modern period and the invention of the telescope helped determine the overall structure of the Solar System.', 'title': 'Discovery and exploration of the Solar System', 'source_u

### 3️⃣ Test embedding a query

In [6]:
query = "What missions have explored Mars?"
embedding_model = "all-MiniLM-L6-v2"
model = SentenceTransformer(embedding_model)
query_vector = model.encode(query).tolist()

print(f"Embedding length: {len(query_vector)}")
print(f"First 5 values: {query_vector[:5]}")
print(f"Type: {type(query_vector)}")

Embedding length: 384
First 5 values: [0.07459472864866257, -0.018688632175326347, -0.017338737845420837, 0.022701118141412735, 0.04478156939148903]
Type: <class 'list'>


### 4️⃣  Qdrant connection and data check

In [7]:
qdrant = QdrantClient(url=QDRANT_URL)
print(qdrant.get_collections())
points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=2, with_payload=True)
for p in points:
    print(p.payload)

# Count points in the "scientific_concepts" collection
print("scientific_concepts count:", qdrant.count("scientific_concepts", exact=True).count)

collections=[CollectionDescription(name='scientific_concepts')]
{'uuid': '001ee393-479a-5352-ad9f-4e515787bc0d', 'text': 'Members\nILEWG Executive Director:  Prof. Bernard Foing (ILEWG Past-President, 1998 - 2000)\nILEWG Vice-presidents: Prof. Tai Sik Lee (2016 - current), Prof. Jacques Blamont (2010 - 2016), Dr. Simonetta di Pippo (2006 – 2008), Dr Robert Richards (2005 - 2007)\nILEWG Past-Presidents : Dr. Michael Wargo (2008 - 2010), Prof. Wu Ji (2006 - 2008), Prof. Narendra Bhandari (2004 - 2006), Prof Carle Pieters (2002 – 2004), Prof Mike Duke (2000-2002), Prof Bernard Foing (1998 – 2000), Acad. Erik Galimov (1996 – 1998), Dr Hitoshi Mizutani', 'title': 'International Lunar Exploration Working Group', 'source_url': 'https://en.wikipedia.org/wiki/International_Lunar_Exploration_Working_Group', 'categories': ['Category:Exploration of the Moon'], 'topic': 'moon', 'images': ['https://upload.wikimedia.org/wikipedia/commons/e/e8/Crystal_Clear_app_kedit.svg', 'https://upload.wikimedia.or

### 5️⃣ Test Qdrant similarity search

In [8]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
)

print("Top 3 retrieved chunks about Mars:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks about Mars:
ID: b788c0ee-1a3e-5cf5-a511-f0c737fd45f7, text snippet: Missions
List
Timeline
See also
Exploration of Mars
List of missions to Mars
List of NASA missions References
Notes...
ID: 0613402e-4202-5e93-b387-66fe35f73ab3, text snippet: Overview of missions
The following entails a brief overview of previous missions to Mars, oriented towards orbiters and flybys; see also Mars landing ...
ID: c8d3a3bf-e790-549a-8276-7284453a7bbc, text snippet: The planet Mars has been explored remotely by spacecraft. Probes sent from Earth, beginning in the late 20th century, have yielded a large increase in...


### Get all points and count topics available

In [9]:
all_points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=1000, with_payload=True)
topics = [p.payload.get("topic", "Unknown") for p in all_points if hasattr(p, "payload")]
topic_counts = Counter(topics)
print("Topics in collection:")
for topic, count in topic_counts.items():
    print(f"{topic}: {count}")

Topics in collection:
moon: 399
mars: 349
planets: 31
other: 175
space exploration: 37
animals in space: 9


### Checking the payload of the first few points

In [10]:
points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=5, with_payload=True)
for i, p in enumerate(points):
    if hasattr(p, "payload"):
        print(f"Point {i} payload keys: {list(p.payload.keys())}")
        print(f"Sample payload: {p.payload}\n")

Point 0 payload keys: ['uuid', 'text', 'title', 'source_url', 'categories', 'topic', 'images', 'summary', 'chunk_index', 'char_start', 'char_end', 'created_at', 'embedding']
Sample payload: {'uuid': '001ee393-479a-5352-ad9f-4e515787bc0d', 'text': 'Members\nILEWG Executive Director:  Prof. Bernard Foing (ILEWG Past-President, 1998 - 2000)\nILEWG Vice-presidents: Prof. Tai Sik Lee (2016 - current), Prof. Jacques Blamont (2010 - 2016), Dr. Simonetta di Pippo (2006 – 2008), Dr Robert Richards (2005 - 2007)\nILEWG Past-Presidents : Dr. Michael Wargo (2008 - 2010), Prof. Wu Ji (2006 - 2008), Prof. Narendra Bhandari (2004 - 2006), Prof Carle Pieters (2002 – 2004), Prof Mike Duke (2000-2002), Prof Bernard Foing (1998 – 2000), Acad. Erik Galimov (1996 – 1998), Dr Hitoshi Mizutani', 'title': 'International Lunar Exploration Working Group', 'source_url': 'https://en.wikipedia.org/wiki/International_Lunar_Exploration_Working_Group', 'categories': ['Category:Exploration of the Moon'], 'topic': 'moo