# **NTLRAG** - **N**arrative **T**opic **L**abels derived with **R**etrieval **A**ugmented **G**eneration

An orchestrated RAG framework for narrative extraction from topic model output.

This implementation of NTLRAG uses:

*   llama3.2 from Ollama for all LLM tasks
*   ChromaDB and BM25 retrievers
*   LangChain and LangGraph for orchestration
*   Pydantic for data structures

The goal of NTLRAG is to extract narratives from document clusters. It is independet of the model used to create those clusters. Also, alternative LLMs, retrievers and orchestration frameworks can be used.

To reproduce results for the sample dataset with this implementation, run the code below sequentially without adjustment.

### Setup and Dependencies

In [None]:
# use setup file to install system packages, ollama installer (change if necessary) and python libraries
!git clone https://github.com/lisagrobels/NTLRAG.git
%cd NTLRAG
!bash setup.sh

In [2]:
# add system path
import sys
sys.path.append("/content/NTLRAG")

In [3]:
# import libraries (most libraries are imported in the dedicated .py files in rag_pipeline folder)

import pandas as pd
import numpy as np
import json
import threading
import subprocess
import time
from enum import Enum
from pathlib import Path

from IPython.display import Markdown

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain_community.document_loaders import JSONLoader
from langchain.schema import Document

In [4]:
# start ollama server
def run_ollama_serve():
  subprocess.Popen(["ollama","serve"])
thread = threading.Thread(target=run_ollama_serve)

thread.start()
time.sleep(5)
!ollama pull llama3.2

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h

### Setup Paths and Retrievers

We use a cleaned subsample (1,000 messages) of the X dataset in https://github.com/sinking8/x-24-us-election (Publication: https://arxiv.org/abs/2411.00376). Shared under CC BY-SA 4.0.

For NTLRAG, three input files are needed:


1.   A .csv file with at least columns: Document = text of the short message, Topic = number of topic assigned to the document,
2.   a .json file with topic keywords (see sample file) and,
3.   a .json file with news data (see sample file).



In [8]:
# all repo paths
REPO_PATH = Path("/content/NTLRAG")
CSV_PATH = REPO_PATH / "data" / "testdata_seedtopics.csv" # csv file with document text and topic number
KEYW_PATH = REPO_PATH / "data" / "testdata_topic_keywords.json" # json file with topic keywords
NEWS_PATH = REPO_PATH / "data" / "testdata_news.json" # json file with news data
output_dir = REPO_PATH / "results" # output directory for the narratives, adjust to yours

# Add repo root to python path for imports
sys.path.append(str(REPO_PATH))


In [9]:
# load topic model output file
df = pd.read_csv(CSV_PATH)

# load topic keywords
with open(KEYW_PATH, "r") as f:
    topic_keywords = json.load(f)

In [10]:
# chroma uses ollama embeddings, adjust LLM if needed
embedding_model = OllamaEmbeddings(
    model="llama3.2",
)

In [None]:
# assign ollamaEmbeddings to chroma builder
chroma = Chroma(
    embedding_function=embedding_model
)

In [12]:
# load news documents from repository
from rag_pipeline.utils import load_json_documents
docs = load_json_documents(NEWS_PATH, content_key="description")

In [None]:
# add news documents to chroma vector storage
chroma.add_documents(docs)

In [14]:
# build chroma retriever
chroma_retriever = chroma.as_retriever(search_kwargs={"k": 5})  # retrieve top 5 docs, adjust number if needed

In [15]:
# build bm25 retriever dict
bm25_retrievers = {}

# make sure topic values are integers or strings, depending on your JSON keys
df['Topic'] = df['Topic'].astype(str)

# loop over each topic to create a BM25 retriever
for topic_id in df['Topic'].unique():
    topic_docs = df[df['Topic'] == topic_id]

    # convert to LangChain Document objects
    documents = [
        Document(
            page_content=row['Document'],
            metadata={"topic": topic_id}
        )
        for _, row in topic_docs.iterrows()
    ]

    # build BM25Retriever for this topic
    retriever = BM25Retriever.from_documents(documents)
    retriever.k = 10

    # add to dictionary
    bm25_retrievers[topic_id] = retriever

In [16]:
# load functions for RAG pipeline
from rag_pipeline.utils import load_json_documents
import rag_pipeline.pipeline_functions as pfct
from rag_pipeline.pipeline_functions import GraphState, Narrative
from rag_pipeline.pipeline_functions import run_narrative_extraction

### Run NTLRAG

In [17]:
# Run narrative extraction
all_approved_narratives, topic_results = run_narrative_extraction(
    topic_keywords=topic_keywords,
    bm25_retrievers=bm25_retrievers,
    chroma_retriever=chroma_retriever,
    output_dir=output_dir
)

# the print provides NTLRAG results including chain-of-thought answers by the LLM for the sample dataset
print(f"Total approved narratives: {len(all_approved_narratives)}")


🚀 Processing topic 0...
🔄 Extract attempt 1 for topic 0...
RAW LLM result: topic_id='' actor='user' action='opine/express negative opinion' event='Joe Biden dropping out of the race, Trump x Biden edits to Chappell songs expiring' description="Some users express disappointment and frustration with Joe Biden's decision to drop out of the race"
Narrative before overwrite: topic_id='' actor='user' action='opine/express negative opinion' event='Joe Biden dropping out of the race, Trump x Biden edits to Chappell songs expiring' description="Some users express disappointment and frustration with Joe Biden's decision to drop out of the race"
Narrative after overwrite: topic_id='0' actor='user' action='opine/express negative opinion' event='Joe Biden dropping out of the race, Trump x Biden edits to Chappell songs expiring' description="Some users express disappointment and frustration with Joe Biden's decision to drop out of the race"
📝 Narrative to grade (may be partial): topic_id='0' actor=

In [None]:
# run this in case ollama disconnects
def run_ollama_serve():
  subprocess.Popen(["ollama","serve"])
thread = threading.Thread(target=run_ollama_serve)

thread.start()
time.sleep(5)

!ollama pull llama3.2

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
