# ScienceSage Sanity Check Notebook

This notebook is a sanity check and data validation tool for the ScienceSage project. It systematically verifies that the data pipeline and retrieval system are working as expected. Here’s what it does, in order:

**Notebook Steps & Logic:**

| Step Number | Short Description                | Details                                                                                                     | 
|-------------|----------------------------------|-------------------------------------------------------------------------------------------------------------|
| 1           | Environment Check                | Prints Python executable and version to confirm the runtime environment.                                    |
| 2           | Downloaded Files                 | Counts and lists the types of files in raw directory.                                                       |
| 3           | Meta Information                 | Shows the fields present in a sample meta.json file.                                                        |
| 4           | Processed Chunks                 | Loads and prints the number of processed text chunks and a sample chunk.                                    |
| 5           | Embeddings Check                 | Reads and displays the first few rows of the embeddings parquet file.                                       |
| 6           | Test Embedding a Query           | Embeds a sample query and prints embedding details.                                                         |
| 7           | Qdrant Connection and Data Check | Connects to Qdrant, lists collections, prints a couple of point payloads, and counts points in a collection.|
| 8           | Get All Points and Count Topics  | Retrieves up to 1000 points, inspects the first 5 payloads, and counts topics across all points.            |
| 9           | Test Qdrant Similarity Search    | Runs a similarity search in Qdrant and prints the top results with scores and payloads.                     |
| 10          | Ground Truth Data                | Prints the first few records from the ground truth dataset.                                                 |
| 11          | Evaluation Results               | Prints the first few records from the evaluation results.                                                   |
| 12          | LLM Evaluation Results           | Prints the first few records from the LLM evaluation results.                                               |
| 13          | Metrics Summary                  | Loads all evaluation metrics into a DataFrame and displays the first few rows as a table.                   |
| 14          | Feedback Data.                   | Prints the first few records from the feedback file.                                                        |
| 15          | Compare Field Names Across Files | Collects and displays the field names from all major data files in a table for schema validation.           |

You can run this before starting the Streamlit app.

In [1]:
import sys
import json
from pathlib import Path
from collections import Counter

import pandas as pd
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchAny, MatchValue
from sciencesage.config import QDRANT_URL, QDRANT_COLLECTION
from sentence_transformers import SentenceTransformer

[32m2025-10-05 02:30:40.627[0m | [1mINFO    [0m | [36msciencesage.config[0m:[36m<module>[0m:[36m168[0m - [1mConfiguration loaded.[0m


### 1. Environment Check

In [2]:
print(sys.executable)

/workspaces/ScienceSage/.venv/bin/python


In [3]:
print(sys.version)

3.12.1 (main, Jul 10 2025, 11:57:50) [GCC 13.3.0]


In [4]:
qdrant = QdrantClient(url=QDRANT_URL)

### 2. Downloaded Files

In [5]:
raw_dir = Path("../data/raw")
file_types = [f.suffix for f in raw_dir.iterdir() if f.is_file()]
type_counts = Counter(file_types)

print("Number of files downloaded in data/raw by type:")
for ext, count in type_counts.items():
    print(f"{ext or '[no extension]'}: {count}")

Number of files downloaded in data/raw by type:
.html: 142
.txt: 142
.json: 142


### 3. Meta Information

In [6]:
meta_files = list(Path("../data/raw").glob("*.meta.json"))
if not meta_files:
    print("No meta.json files found.")
else:
    sample_file = meta_files[0]
    with open(sample_file, "r") as f:
        meta = json.load(f)
    print(f"Sample meta.json file: {sample_file.name}")
    print("Fields:", list(meta.keys()))

Sample meta.json file: wikipedia_SpaceIL.meta.json
Fields: ['title', 'fullurl', 'categories', 'summary', 'images']


### 4. Processed Chunks

In [7]:
chunks_file = Path('../data/processed/chunks.jsonl')
chunks = [json.loads(line) for line in chunks_file.open('r', encoding='utf-8')]
print(f"Total chunks: {len(chunks)}")
print("Sample chunk:")
print(chunks[0])

Total chunks: 2713
Sample chunk:
{'chunk_id': '90a4913a-d951-5c8d-ac1b-ca7ff747285b', 'text': 'Discovery and exploration of the Solar System is observation, visitation, and increase in knowledge and understanding of Earth\'s "cosmic neighborhood". This includes the Sun, Earth and the Moon, the major planets Mercury, Venus, Mars, Jupiter, Saturn, Uranus, and Neptune, their satellites, as well as smaller bodies including comets, asteroids, and dust.\nIn ancient and medieval times, only objects visible to the naked eye—the Sun, the Moon, the five classical planets, and comets, along with phenomena now known to take place in Earth\'s atmosphere, like meteors and aurorae—were known. Ancient astronomers were able to make geometric observations with various instruments. The collection of precise observations in the early modern period and the invention of the telescope helped determine the overall structure of the Solar System.', 'title': 'Discovery and exploration of the Solar System', 'sour

### 5. Embeddings Check

In [8]:
embeddings_path = "../data/embeddings/embeddings.parquet"
embed_df = pd.read_parquet(embeddings_path)
print(embed_df.head())

                               chunk_id  \
0  08629607-a83c-540c-84c6-6a4027fbda6e   
1  9c310f61-19db-5455-a0de-7ee73caa3369   
2  159b3c27-4cea-5045-948d-e48bcb56444e   
3  d56cb574-eee7-598e-a8e5-61531f7def3e   
4  3bc8f500-e9b0-5f75-8731-3c8f1fc89f3c   

                                                text  \
0  Discovery and exploration of the Solar System ...   
1  Telescopic observations resulted in the discov...   
2  Observations of Solar System bodies with other...   
3  Pre-telescope\nThe first humans had limited un...   
4  Many associated the classical planets (star-li...   

                                           title  \
0  Discovery and exploration of the Solar System   
1  Discovery and exploration of the Solar System   
2  Discovery and exploration of the Solar System   
3  Discovery and exploration of the Solar System   
4  Discovery and exploration of the Solar System   

                                          source_url  \
0  https://en.wikipedia.org/wiki/Di

### 6. Test Embedding a Query

In [9]:
query = "What missions have explored Mars?"
embedding_model = "all-MiniLM-L6-v2"
model = SentenceTransformer(embedding_model)
query_vector = model.encode(query).tolist()

print(f"Embedding length: {len(query_vector)}")
print(f"First 5 values: {query_vector[:5]}")
print(f"Type: {type(query_vector)}")

Embedding length: 384
First 5 values: [0.07459472864866257, -0.018688632175326347, -0.017338737845420837, 0.022701118141412735, 0.04478156939148903]
Type: <class 'list'>


### 7. Qdrant Connection and Data Check

In [10]:
qdrant = QdrantClient(url=QDRANT_URL)
print(qdrant.get_collections())
points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=2, with_payload=True)
for p in points:
    print(p.payload)

# Count points in the "scientific_concepts" collection
print("scientific_concepts count:", qdrant.count("scientific_concepts", exact=True).count)

collections=[CollectionDescription(name='scientific_concepts')]
{'chunk_id': '0012728c-8f67-534b-b6b4-ae4e128dbc39', 'text': "Deep Space Transport LLC is a joint venture that is set to provide launch services for the Space Launch System rocket. The joint venture consists of Boeing, the prime contractor for the Space Launch System core stage and the Exploration Upper Stage that will be used on Space Launch System missions, and Northrop Grumman, the prime contractor for the Space Launch System's solid rocket boosters.\nwill achieve significant cost savings by shifting procurement of future Space Launch System rockets to a commercial services contract. \nDeep Space Transport LLC would be responsible for producing hardware and services for up to 10 Artemis launches beginning with the Artemis V mission, and up to 10 launches for other NASA missions. NASA expects to procure at least one flight per year to the Moon or other deep-space destinations.", 'title': 'Deep Space Transport LLC', 'sour

### 8. Get All Points and Count Topics

In [11]:
points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=1000, with_payload=True)

# Inspect the first 5 payloads
for i, p in enumerate(points[:5]):
    if hasattr(p, "payload"):
        print(f"Point {i} payload keys: {list(p.payload.keys())}")
        print(f"Sample payload: {p.payload}\n")

# Count topics in all retrieved points
topics = [p.payload.get("topic", "Unknown") for p in points if hasattr(p, "payload")]
topic_counts = Counter(topics)
print("Topics in collection:")
for topic, count in topic_counts.items():
    print(f"{topic}: {count}")

Point 0 payload keys: ['chunk_id', 'text', 'title', 'source_url', 'categories', 'topic', 'images', 'summary', 'chunk_index', 'char_start', 'char_end', 'created_at', 'embedding']
Sample payload: {'chunk_id': '0012728c-8f67-534b-b6b4-ae4e128dbc39', 'text': "Deep Space Transport LLC is a joint venture that is set to provide launch services for the Space Launch System rocket. The joint venture consists of Boeing, the prime contractor for the Space Launch System core stage and the Exploration Upper Stage that will be used on Space Launch System missions, and Northrop Grumman, the prime contractor for the Space Launch System's solid rocket boosters.\nwill achieve significant cost savings by shifting procurement of future Space Launch System rockets to a commercial services contract. \nDeep Space Transport LLC would be responsible for producing hardware and services for up to 10 Artemis launches beginning with the Artemis V mission, and up to 10 launches for other NASA missions. NASA expects 

### 9. Test Qdrant Similarity Search

In [12]:
results = qdrant.query_points(
    collection_name=QDRANT_COLLECTION,
    query=query_vector,
    limit=3,
)

print("Top 3 retrieved chunks about Mars:")
if hasattr(results, "points") and results.points:
    for p in results.points:
        text = p.payload.get('text', '') if hasattr(p, 'payload') else ''
        print(f"ID: {getattr(p, 'id', 'N/A')}, text snippet: {text[:150]}...")
        print("Score:", p.score)
        print("Payload:", p.payload)
else:
    print("No chunks retrieved. Check your Qdrant collection and query.")

Top 3 retrieved chunks about Mars:
ID: 524b7a57-0656-5964-8160-2b4b9ce4d16f, text snippet: Missions
List
Timeline
See also
Exploration of Mars
List of missions to Mars
List of NASA missions References
Notes...
Score: 0.72573817
Payload: {'chunk_id': '524b7a57-0656-5964-8160-2b4b9ce4d16f', 'text': 'Missions\nList\nTimeline\nSee also\nExploration of Mars\nList of missions to Mars\nList of NASA missions References\nNotes', 'title': 'Mars Exploration Program', 'source_url': 'https://en.wikipedia.org/wiki/Mars_Exploration_Program', 'categories': ['Category:Exploration of Mars', 'Category:Mars Exploration Program', 'Category:NASA programs'], 'topic': 'mars', 'images': ['https://upload.wikimedia.org/wikipedia/commons/5/58/2001_Mars_Odyssey_-_mars-odyssey-logo-sm.png', 'https://upload.wikimedia.org/wikipedia/commons/4/4e/Entry.jpg', 'https://upload.wikimedia.org/wikipedia/commons/b/b5/InSight_Mission_Logo.svg', 'https://upload.wikimedia.org/wikipedia/commons/1/18/M98patch.png', 'https://upload

### 10. Ground Truth Data

In [13]:
with open("../data/ground_truth/ground_truth_dataset.jsonl") as f:
    for i, line in enumerate(f):
        print(json.loads(line))
        if i >= 4:
            break

{'chunk_id': '451501cf-51c2-5954-abab-35ff573a201d', 'topic': 'other', 'text': 'Pioneer Venus\nIn 1978, NASA sent two Pioneer spacecraft to Venus. The Pioneer mission consisted of two components, launched separately: an orbiter and a multiprobe. The Pioneer Venus Multiprobe carried one large and three small atmospheric probes. The large probe was released on November 16, 1978, and the three small probes on November 20. All four probes entered the Venusian atmosphere on December 9, followed by the delivery vehicle. Although not expected to survive the descent through the atmosphere, one probe continued to operate for 45 minutes after reaching the surface. The Pioneer Venus Orbiter was inserted into an elliptical orbit around Venus on December 4, 1978. It carried 17 experiments and operated until the fuel used to maintain its orbit was exhausted and atmospheric entry destroyed the spacecraft in August 1992.', 'level': 'Middle School', 'question': 'What year did NASA send the Pioneer spac

### 11. Evaluation Results

In [14]:
with open("../data/eval/eval_results.jsonl") as f:
    for i, line in enumerate(f):
        print(json.loads(line))
        if i >= 4:
            break

{'query': 'What year did NASA send the Pioneer spacecraft to Venus?', 'expected_answer': '1978', 'retrieved_chunks': ['5c2ed7e7-af71-5401-8ae1-59b0b1af0b9a', 'f85242a8-b3d5-50d9-a3ad-4ca26ff918cf', '33196586-1434-5ca1-af87-079567be1b82', '6b325b28-7197-53ea-ae46-74c152b2bb6d', 'b0430a88-a97e-57e4-be0d-e090c005c4f8', 'accbfd16-f797-5e51-9207-a9bc0ef5541f', 'fc885d62-c05e-50ab-bc36-ce64bc3eab93', '621a1b61-5148-5f5a-bfba-95f54db5e158', 'c9560bf2-15ee-5356-baff-8f633d9f5ed1', '9a978ee6-bb3e-5a02-b5b5-b68fa45a2611'], 'retrieved_context': ['Pioneer Venus\nIn 1978, NASA sent two Pioneer spacecraft to Venus. The Pioneer mission consisted of two components, launched separately: an orbiter and a multiprobe. The Pioneer Venus Multiprobe carried one large and three small atmospheric probes. The large probe was released on November 16, 1978, and the three small probes on November 20. All four probes entered the Venusian atmosphere on December 9, followed by the delivery vehicle. Although not expec

### 12. LLM Evaluation Results

In [15]:
with open("../data/eval/llm_eval.jsonl") as f:
    for i, line in enumerate(f):
        print(json.loads(line))
        if i >= 4:
            break

{'query': 'What year did NASA send the Pioneer spacecraft to Venus?', 'expected_answer': '1978', 'retrieved_answer': 'NASA sent the Pioneer spacecraft to Venus in 1978. They launched two Pioneer spacecraft as part of the Pioneer Venus mission, which included an orbiter and a multiprobe [Source: https://en.wikipedia.org/wiki/Observations_and_explorations_of_Venus, Chunk: 5c2ed7e7-af71-5401-8ae1-59b0b1af0b9a]. \n\nReferences:\n- https://en.wikipedia.org/wiki/Observations_and_explorations_of_Venus', 'exact_match': 0.0, 'retrieved_chunks': ['5c2ed7e7-af71-5401-8ae1-59b0b1af0b9a', 'f85242a8-b3d5-50d9-a3ad-4ca26ff918cf', '33196586-1434-5ca1-af87-079567be1b82', '6b325b28-7197-53ea-ae46-74c152b2bb6d', 'b0430a88-a97e-57e4-be0d-e090c005c4f8', 'accbfd16-f797-5e51-9207-a9bc0ef5541f', 'fc885d62-c05e-50ab-bc36-ce64bc3eab93', '621a1b61-5148-5f5a-bfba-95f54db5e158', 'c9560bf2-15ee-5356-baff-8f633d9f5ed1', '9a978ee6-bb3e-5a02-b5b5-b68fa45a2611'], 'retrieved_context': ['Pioneer Venus\nIn 1978, NASA sent

### 13. Metrics Summary

In [16]:
# Read all metrics from eval_results.jsonl
metrics = []
with open("../data/eval/eval_results.jsonl") as f:
    for line in f:
        metrics.append(json.loads(line))

# Convert to DataFrame and display as table
if metrics:
    df = pd.DataFrame(metrics)
    display(df.head())
else:
    print("No metrics found in eval_results.jsonl")

Unnamed: 0,query,expected_answer,retrieved_chunks,retrieved_context,ground_truth_chunks,precision_at_k,recall_at_k,reciprocal_rank,ndcg_at_k,topic,level,metadata
0,What year did NASA send the Pioneer spacecraft...,1978,"[5c2ed7e7-af71-5401-8ae1-59b0b1af0b9a, f85242a...","[Pioneer Venus\nIn 1978, NASA sent two Pioneer...",[451501cf-51c2-5954-abab-35ff573a201d],0.0,0.0,0.0,0.0,other,Middle School,"{'source_text': 'Pioneer Venus In 1978, NASA s..."
1,What were the two main components of the Pione...,An orbiter and a multiprobe,"[5c2ed7e7-af71-5401-8ae1-59b0b1af0b9a, f85242a...","[Pioneer Venus\nIn 1978, NASA sent two Pioneer...",[451501cf-51c2-5954-abab-35ff573a201d],0.0,0.0,0.0,0.0,other,College,"{'source_text': 'Pioneer Venus In 1978, NASA s..."
2,How long did one of the probes operate after r...,45 minutes,"[33196586-1434-5ca1-af87-079567be1b82, fc885d6...",[Observation by spacecraft\nThere have been nu...,[451501cf-51c2-5954-abab-35ff573a201d],0.0,0.0,0.0,0.0,other,Advanced,"{'source_text': 'Pioneer Venus In 1978, NASA s..."
3,What is the axial tilt of Mars?,25.19°,"[fb00943e-7705-5b5d-86ab-0dfc9a35e078, df72a52...",[Temperature and seasons\nMars has an axial ti...,[88ae4373-5538-5c86-bf6d-e88b8a89ef3f],0.0,0.0,0.0,0.0,mars,Middle School,{'source_text': 'Temperature and seasons Mars ...
4,How do the seasons on Mars compare to those on...,"Mars has seasons much like Earth, but they las...","[fb00943e-7705-5b5d-86ab-0dfc9a35e078, 7061136...",[Temperature and seasons\nMars has an axial ti...,[88ae4373-5538-5c86-bf6d-e88b8a89ef3f],0.0,0.0,0.0,0.0,mars,College,{'source_text': 'Temperature and seasons Mars ...


### 14. Feedback Data

In [17]:
feedback_path = "../data/feedback/feedback.jsonl"
try:
    with open(feedback_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            print(json.loads(line))
            if i >= 4:
                break
except FileNotFoundError:
    print("feedback.jsonl not found.")

{'timestamp': '2025-10-02T18:57:54.524751+00:00', 'query': 'Who was the first human to travel into outer space, and in which spacecraft did they fly?', 'answer': 'The first human to travel into outer space was Yuri Gagarin, a Russian cosmonaut. He flew in a spacecraft called Vostok 1 on April 12, 1961. During his flight, he completed one orbit around the Earth, which took about 1 hour and 48 minutes [Source: https://en.wikipedia.org/wiki/Space_exploration, Chunk: 0].\n\n**References:**\n- https://en.wikipedia.org/wiki/Space_exploration,', 'topic': 'Category:Exploration of the Moon', 'level': 'Middle School', 'feedback': 'up', 'sources': None}
{'timestamp': '2025-10-02T19:01:59.618613+00:00', 'query': 'Who was the second human to travel into space?', 'answer': 'I don’t know based on the available information.', 'topic': 'Category:Exploration of the Moon', 'level': 'Middle School', 'feedback': 'down', 'sources': []}
{'timestamp': '2025-10-02T19:07:47.832871+00:00', 'query': 'Who was the 

### 15. Compare Field Names Across Files

In [18]:
def get_jsonl_fields(path):
    try:
        with open(path, "r", encoding="utf-8") as f:
            for line in f:
                return list(json.loads(line).keys())
    except Exception:
        return []
    return []

In [19]:
summary = {}

# meta.json
meta_files = list(Path("../data/raw").glob("*.meta.json"))
if meta_files:
    with open(meta_files[0], "r") as f:
        summary["meta.json"] = list(json.load(f).keys())
else:
    summary["meta.json"] = []

# chunks.jsonl
summary["chunks.jsonl"] = get_jsonl_fields("../data/processed/chunks.jsonl")

# embeddings.parquet
embeddings_file = Path("../data/embeddings/embeddings.parquet")
if embeddings_file.exists():
    summary["embeddings.parquet"] = list(pd.read_parquet(embeddings_file, engine="pyarrow").columns)
else:
    summary["embeddings.parquet"] = []

# qdrant points
try:
    qdrant = QdrantClient(url=QDRANT_URL)
    points, _ = qdrant.scroll(collection_name=QDRANT_COLLECTION, limit=1, with_payload=True)
    if points and hasattr(points[0], "payload"):
        summary["qdrant points"] = list(points[0].payload.keys())
    else:
        summary["qdrant points"] = []
except Exception:
    summary["qdrant points"] = []

# ground_truth_dataset.jsonl
summary["ground_truth_dataset.jsonl"] = get_jsonl_fields("../data/ground_truth/ground_truth_dataset.jsonl")

# eval_results.jsonl
summary["eval_results.jsonl"] = get_jsonl_fields("../data/eval/eval_results.jsonl")

# llm_eval.jsonl
summary["llm_eval.jsonl"] = get_jsonl_fields("../data/eval/llm_eval.jsonl")

# feedback.jsonl
summary["feedback.jsonl"] = get_jsonl_fields("../data/feedback/feedback.jsonl")

# Display as table
df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in summary.items()]))
df = df.fillna("")  # Replace NaN with blank
display(df)

Unnamed: 0,meta.json,chunks.jsonl,embeddings.parquet,qdrant points,ground_truth_dataset.jsonl,eval_results.jsonl,llm_eval.jsonl,feedback.jsonl
0,title,chunk_id,chunk_id,chunk_id,chunk_id,query,query,timestamp
1,fullurl,text,text,text,topic,expected_answer,expected_answer,query
2,categories,title,title,title,text,retrieved_chunks,retrieved_answer,answer
3,summary,source_url,source_url,source_url,level,retrieved_context,exact_match,topic
4,images,categories,categories,categories,question,ground_truth_chunks,retrieved_chunks,level
5,,topic,topic,topic,answer,precision_at_k,retrieved_context,feedback
6,,images,images,images,ground_truth_chunks,recall_at_k,ground_truth_chunks,sources
7,,summary,summary,summary,,reciprocal_rank,precision_at_k,
8,,chunk_index,chunk_index,chunk_index,,ndcg_at_k,recall_at_k,
9,,char_start,char_start,char_start,,topic,reciprocal_rank,
