# **NTLRAG** - **N**arrative **T**opic **L**abels derived with **R**etrieval **A**ugmented **G**eneration

An orchestrated RAG framework for narrative extraction from topic model output.

This implementation of NTLRAG uses:

*   llama3.2 from Ollama for all LLM tasks
*   ChromaDB and BM25 retrievers
*   LangChain and LangGraph for orchestration
*   Pydantic for data structures

The goal of NTLRAG is to extract narratives from document clusters. It is independet of the model used to create those clusters. Also, alternative LLMs, retrievers and orchestration frameworks can be used.

### Setup and Dependencies

In [None]:
# use setup file to install system packages, ollama installer (change if necessary) and python libraries
!git clone https://github.com/lisagrobels/NTLRAG.git
%cd NTLRAG
!bash setup.sh

Cloning into 'NarrRAG'...
remote: Enumerating objects: 156, done.[K
remote: Counting objects: 100% (156/156), done.[K
remote: Compressing objects: 100% (152/152), done.[K
remote: Total 156 (delta 77), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (156/156), 650.83 KiB | 5.81 MiB/s, done.
Resolving deltas: 100% (77/77), done.
/content/NarrRAG
Check and update system packages...
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 https://cli.github.com/packages stable InRelease [3,917 B]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:

In [None]:
import sys
sys.path.append("/content/NTLRAG")

In [None]:
# import libraries

# general libraries
import pandas as pd
import numpy as np
import json
import threading
import subprocess
import time
from enum import Enum
from pathlib import Path

from IPython.display import Markdown

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain_community.document_loaders import JSONLoader
from langchain.schema import Document

In [None]:
# start ollama server
def run_ollama_serve():
  subprocess.Popen(["ollama","serve"])
thread = threading.Thread(target=run_ollama_serve)

thread.start()
time.sleep(5)
!ollama pull llama3.2

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?202

### Setup Paths and Retrievers

We use a cleaned subsample (1,000 messages) of the X dataset in https://github.com/sinking8/x-24-us-election (Publication: https://arxiv.org/abs/2411.00376). Shared under CC BY-SA 4.0.

For NarrRAG, three input files are needed:


1.   A .csv file with at least columns: Document = text of the short message, Topic = number of topic assigned to the document,
2.   a .json file with topic keywords (see sample file) and,
3.   a .json file with news data (see sample file).



In [None]:
REPO_PATH = Path("/content/NarrRAG")
CSV_PATH = REPO_PATH / "data" / "testdata_seedtopics.csv" # csv file with document text and topic number
KEYW_PATH = REPO_PATH / "data" / "testdata_topic_keywords.json" # json file with topic keywords
NEWS_PATH = REPO_PATH / "data" / "testdata_news.json" # json file with news data
output_dir = REPO_PATH / "results" # output directory for the narratives, choose yours

# Add repo root to Python path for imports
sys.path.append(str(REPO_PATH))


In [None]:
# load topic model output file
df = pd.read_csv(CSV_PATH)

# load topic keywords
with open(KEYW_PATH, "r") as f:
    topic_keywords = json.load(f)

In [None]:
# chroma uses ollama embeddings, adjust if needed
embedding_model = OllamaEmbeddings(
    model="llama3.2",
)

In [None]:
chroma = Chroma(
    embedding_function=embedding_model
)

  chroma = Chroma(


In [None]:
# Load news documents
from rag_pipeline.utils import load_json_documents
docs = load_json_documents(NEWS_PATH, content_key="description")

In [None]:
# add news documents to chroma vector storage
chroma.add_documents(docs)

['b453a7a3-8e05-45f2-919e-0576592bef3d',
 '8cc50995-1452-4e82-b2bd-2c0caa782be1',
 'b970f72d-a6c0-4888-8d6f-65a7f3fa849e',
 '57140f3c-9b60-469a-8222-eb999e49d608',
 'f4d530c1-68e2-4a75-a941-be9ffeed5290',
 '6ac56571-0c4c-49bb-9868-8348b084ed48',
 '0e497e9d-5d58-4e38-9a2d-22ae35a46584',
 '010f17e8-f92c-42b9-9428-871f6463d512',
 '76f8d95d-d4be-45a6-afb8-f2a2ff23bb0d',
 '3014d617-2e1b-4184-8e30-8e1bf44f0c5f',
 'bb94932d-7a91-45e9-83c5-dab0382da7a7',
 '2c84e832-9d8f-42a1-b660-3c988776942c',
 '86aded38-cc7d-46b1-9dda-6a558132d4d3',
 '3cc1441d-b6e9-47cd-b9aa-36e63e363f40',
 '4875963e-e0d8-4b6a-bba9-5d6220e0feee',
 '2a251952-2d0b-4bbb-906c-13a0fbd62013',
 '8a35488f-eccd-4c4e-81a3-9a3cc785aae4',
 'b8777bcc-2c19-438e-9e7a-986893c5714b',
 '9bb5e6f7-576c-4e4e-9ed0-c1c6fae736bf',
 'bc25fc51-ad62-44b5-9b38-d79c19a71791',
 'e1fb2638-db85-4630-b687-208aaed50b6c',
 'd49bff05-f1fc-44bf-8227-80448e90570d',
 'b6899989-b71b-4276-8b4b-1e42543dcc60',
 'd82c714c-a1e7-4c4e-8549-96daa7d3d191',
 '0bcebb14-7c32-

In [None]:
# Build retriever
chroma_retriever = chroma.as_retriever(search_kwargs={"k": 5})  # retrieve top 5 docs, adjust if needed

In [None]:
# Build bm25 retriever dict
bm25_retrievers = {}

# Make sure topic values are integers or strings, depending on your JSON keys
df['Topic'] = df['Topic'].astype(str)

# Loop over each topic to create a BM25 retriever
for topic_id in df['Topic'].unique():
    topic_docs = df[df['Topic'] == topic_id]

    # Convert to LangChain Document objects
    documents = [
        Document(
            page_content=row['Document'],
            metadata={"topic": topic_id}
        )
        for _, row in topic_docs.iterrows()
    ]

    # Build BM25Retriever for this topic
    retriever = BM25Retriever.from_documents(documents)
    retriever.k = 10

    # Add to dictionary
    bm25_retrievers[topic_id] = retriever

In [None]:
# load functions for RAG pipeline
from rag_pipeline.utils import load_json_documents
import rag_pipeline.pipeline_functions as pfct
from rag_pipeline.pipeline_functions import GraphState, Narrative
from rag_pipeline.pipeline_functions import run_narrative_extraction

### Run NTLRAG

In [None]:
# Run narrative extraction
all_approved_narratives, topic_results = run_narrative_extraction(
    topic_keywords=topic_keywords,
    bm25_retrievers=bm25_retrievers,
    chroma_retriever=chroma_retriever,
    output_dir=output_dir
)
print(f"Total approved narratives: {len(all_approved_narratives)}")


🚀 Processing topic 0...
🔄 Extract attempt 1 for topic 0...
RAW LLM result: topic_id='Oh Joe' actor='users' action='express opinions and complaints about' event="Joe Biden's decision to drop out of the race, his age and senility" description="Users express mixed feelings towards Joe Biden's decision to drop out of the race"
Narrative before overwrite: topic_id='Oh Joe' actor='users' action='express opinions and complaints about' event="Joe Biden's decision to drop out of the race, his age and senility" description="Users express mixed feelings towards Joe Biden's decision to drop out of the race"
Narrative after overwrite: topic_id='0' actor='users' action='express opinions and complaints about' event="Joe Biden's decision to drop out of the race, his age and senility" description="Users express mixed feelings towards Joe Biden's decision to drop out of the race"
📝 Narrative to grade (may be partial): topic_id='0' actor='users' action='express opinions and complaints about' event="Joe 

In [None]:
# in case ollama disconnects
def run_ollama_serve():
  subprocess.Popen(["ollama","serve"])
thread = threading.Thread(target=run_ollama_serve)

thread.start()
time.sleep(5)

!ollama pull llama3.2

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
