# **IIT Kharagpur Data Science Hackathon (KDSH) 2026 – Track A**

---
---
> **Task:** Global Narrative Consistency Reasoning over Long-Form Texts.

This notebook implements an end-to-end pipeline to determine whether a hypothetical character backstory is causally and logically consistent with a full-length narrative. The system focuses on long-context handling, evidence aggregation, and rule-based consistency judgment rather than text generation.



**Data Loading**

---
Mount Google Drive and load the dataset provided by the hackathon organizers.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Environment Setup**

---

Install required Python libraries for the pipeline.


In [2]:
!pip install pathway sentence-transformers pandas numpy tqdm

Collecting pathway
  Downloading pathway-0.28.0-cp310-abi3-manylinux_2_24_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting h3>=4 (from pathway)
  Downloading h3-4.4.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (18 kB)
Collecting python-sat>=0.1.8.dev0 (from pathway)
  Downloading python_sat-1.8.dev26-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Collecting beartype<0.16.0,>=0.14.0 (from pathway)
  Downloading beartype-0.15.0-py3-none-any.whl.metadata (28 kB)
Collecting diskcache>=5.2.1 (from pathway)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting boto3<2.0.0,>=1.26.76 (from pathway)
  Downloading boto3-1.42.25-py3-none-any.whl.metadata (6.8 kB)
Collecting jupyt

**Load Dataset**

---
Define dataset paths and load training and test CSV files.


In [1]:
import pandas as pd
import os

BASE_PATH = "/content/drive/MyDrive/Dataset"
BOOK_PATH = f"{BASE_PATH}/BOOK"

train_df = pd.read_csv(f"{BASE_PATH}/train.csv")
test_df  = pd.read_csv(f"{BASE_PATH}/test.csv")

print("Train columns:", train_df.columns)
print("Test columns:", test_df.columns)


Train columns: Index(['id', 'book_name', 'char', 'caption', 'content', 'label'], dtype='object')
Test columns: Index(['id', 'book_name', 'char', 'caption', 'content'], dtype='object')


**Load Full Narratives**

---
Load the complete novel texts used for long-context analysis.


In [2]:
novels = {}

with open("/content/drive/MyDrive/Dataset/Books/In search of the castaways.txt",
          "r", encoding="utf-8") as f:
    novels[1] = f.read()

with open("/content/drive/MyDrive/Dataset/Books/The Count of Monte Cristo.txt",
          "r", encoding="utf-8") as f:
    novels[2] = f.read()

print("Loaded novels:", novels.keys())


Loaded novels: dict_keys([1, 2])


**Chunk Long Narratives**

---
Split each full novel into overlapping chunks to preserve long-range context.


In [3]:
def chunk_text(text, chunk_size=1200, overlap=200):
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i+chunk_size])
        i += chunk_size - overlap
    return chunks

novel_chunks = {}
for sid, text in novels.items():
    novel_chunks[sid] = chunk_text(text)

print("Chunks created for stories")


Chunks created for stories


**Pathway Ingestion**

---
Ingest chunked narrative data into a structured Pathway table.


In [4]:
import pathway as pw
import pandas as pd
import os

# 1️⃣ Write chunks to a temporary CSV
tmp_csv_path = "/content/chunks_for_pathway.csv"

rows = []
for sid, chunks in novel_chunks.items():
    for ch in chunks:
        rows.append({
            "story_id": sid,
            "text": ch
        })

df_chunks = pd.DataFrame(rows)
df_chunks.to_csv(tmp_csv_path, index=False)

print("Temporary CSV created:", tmp_csv_path)

# 2️⃣ Define Pathway schema
class ChunkSchema(pw.Schema):
    story_id: int
    text: str

# 3️⃣ Read CSV using Pathway (MOST STABLE)
chunk_table = pw.io.fs.read(
    tmp_csv_path,
    format="csv",
    schema=ChunkSchema
)

print("✅ Pathway table created successfully via fs.read")


Temporary CSV created: /content/chunks_for_pathway.csv
✅ Pathway table created successfully via fs.read


**Generate Semantic Embeddings**

---
Compute vector representations for narrative chunks to enable retrieval.


In [5]:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

chunk_embeddings = {}
for sid, chunks in novel_chunks.items():
    chunk_embeddings[sid] = model.encode(chunks)

print("✅ Embeddings generated")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings generated


**Retrieve Relevant Evidence**

---
Select the most relevant narrative chunks for a given backstory.


In [6]:
def retrieve_chunks(story_id, backstory, k=5):
    chunks = novel_chunks[story_id]
    emb = chunk_embeddings[story_id]
    q = model.encode(backstory)
    scores = np.dot(emb, q)
    top_k = scores.argsort()[-k:]
    return [chunks[i] for i in top_k]


**Consistency Judgment**

---
Apply rule-based reasoning over retrieved evidence to classify backstory consistency.


In [7]:
def judge(backstory, excerpts):
    text = (" ".join(excerpts) + " " + backstory).lower()

    score = 0

    positive = ["believe", "learned", "decided", "promised", "trained"]
    negative = ["betrayed", "refused", "denied", "abandoned"]

    for p in positive:
        if p in text:
            score += 1

    for n in negative:
        if n in text:
            score -= 1

    # simple balance rule
    if score >= 0:
        return 1
    else:
        return 0


**Inspect Test Schema**

---
Check column names in the test dataset to ensure correct mapping.


In [8]:
print(test_df.columns)


Index(['id', 'book_name', 'char', 'caption', 'content'], dtype='object')


**Generate Final Predictions**

---
Produce the final binary consistency predictions and save them to results.csv .


In [9]:
# 0️⃣ Mapping (lowercase keys)
book_to_id = {
    "in search of the castaways": 1,
    "the count of monte cristo": 2
}

# 1️⃣ Generate predictions
preds = []

for _, row in test_df.iterrows():
    book_lower = row["book_name"].strip().lower()  # normalize
    story_id = book_to_id[book_lower]             # map to numeric ID
    backstory = row["content"]

    chunks = retrieve_chunks(story_id, backstory)
    pred = judge(backstory, chunks)
    preds.append(pred)

# 2️⃣ Save results
test_df["prediction"] = preds
test_df[["prediction"]].to_csv("results.csv", index=False)

print("✅ results.csv generated successfully")


✅ results.csv generated successfully


**Final Submission Output**

---
Create the final results.csv file in the required Track A format.


In [10]:
preds = []

for _, row in test_df.iterrows():
    book_lower = row["book_name"].strip().lower()
    story_num = 1 if "castaways" in book_lower else 2

    backstory = row["content"]
    chunks = retrieve_chunks(story_num, backstory)

    pred = judge(backstory, chunks)
    preds.append(pred)

results_df = pd.DataFrame({
    "story_id": test_df["id"],
    "prediction": preds
})

results_df.to_csv("results.csv", index=False)
print("✅ Final Track A results.csv generated")


✅ Final Track A results.csv generated


**Download Results**

---
Download the generated results.csv file for submission.


In [11]:
from google.colab import files
files.download("results.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>