Narrative Retrieval + RAG Pipeline

This notebook/script powers the **Narrative RAG (Retrieval-Augmented Generation)** route for the Fassistant. It enables Eagles-specific storytelling by retrieving and summarizing relevant context from chunked Wikipedia-style narratives (e.g., season recaps, playoff stories) using vector search and OpenAI.

---

## üìö Overview

This module performs the following steps:

1. **Chunking**  
   Splits long narrative `.txt` files into overlapping text chunks with metadata (title, chunk ID).

2. **Embedding**  
   Encodes each chunk using `sentence-transformers` for semantic vector representation.

3. **Indexing**  
   Stores chunk vectors in a FAISS index for fast similarity search. Saves metadata in parallel.

4. **Retrieval**  
   Retrieves top-k semantically similar chunks for a given question.  
   If a year (e.g., 2021) is detected in the question, results from that year are prioritized.

5. **LLM Summarization**  
   Combines retrieved chunks into a prompt and sends to OpenAI   
   for grounded narrative answers. No hallucination ‚Äî answers are limited to retrieved context.

---

## üì• Inputs

- `.txt` narrative files in `data/narratives/`  
- `.env` file with:
  - `OPENAI_API_KEY`
  - `REQUESTS_CA_BUNDLE` (for corporate SSL certs, if needed)

---

## üì§ Outputs

- `data/chunks.jsonl`: Chunked text and metadata  
- `data/narrative_index.faiss`: FAISS vector index  
- `data/narrative_metadata.json`: Metadata for chunk display (title, text)

---

## üß† Example Use Case

> ‚ÄúWhich was more disappointing ‚Äî the 2020 or 2023 Eagles season?‚Äù  
> ‚Üí Retrieves relevant recaps and feeds to LLM to generate a grounded comparison.

Integrated into the question router alongside the **Stats Text-to-SQL** route.

---

In [None]:
import sys
import os
from dotenv import load_dotenv

#set cwd to parent dir
os.chdir("..")

# Load .env file
load_dotenv()

# Force all HTTPS libraries to use corporate cert
os.environ["REQUESTS_CA_BUNDLE"] = os.getenv("REQUESTS_CA_BUNDLE")
os.environ["CURL_CA_BUNDLE"] = os.getenv("REQUESTS_CA_BUNDLE")
os.environ["SSL_CERT_FILE"] = os.getenv("REQUESTS_CA_BUNDLE") 
from paths import CHUNKS_JSONL_PATH, FAISS_INDEX_PATH, METADATA_JSON_PATH,NARRATIVE_DIR 

#print("Chunks JSONL Path:", CHUNKS_JSONL_PATH)


Chunks JSONL Path: c:\Users\250331\Documents\GenAI Projects\EaglesGPT\data\chunks.jsonl


don't forget to add note about 2023 not being picked up (which season was more disappointing 2020 vs 2023) in retrieval but rather a Nick Sirianni chunk, forcing reranking to prioritize any retrieved chunk with year in title after grabbing top k * 3

1. Load and Chunk the wikipedia narrative docs.

In [None]:
import os
import json

# -----------------------------------
# Step 1: Chunking Function
# -----------------------------------
def chunk_text_fixed(text, chunk_size=300, overlap=50):
    """
    Splits text into overlapping chunks.
    Each chunk has `chunk_size` words, with `overlap` words shared with the next chunk.
    """
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    if step <= 0:
        raise ValueError("chunk_size must be greater than overlap")

    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

# -----------------------------------
# Step 2: Load .txt Files, Chunk, Print Examples, Save as JSONL
# -----------------------------------
def load_and_chunk_wikipedia(folder_path, output_path, chunk_size=300, overlap=50):
    all_chunks = []

    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            full_path = os.path.join(folder_path, filename)
            with open(full_path, 'r', encoding='utf-8') as f:
                text = f.read()
                doc_title = filename.replace('.txt', '')

                chunks = chunk_text_fixed(text, chunk_size=chunk_size, overlap=overlap)

                for i, chunk in enumerate(chunks):
                    chunk_id = f"{doc_title}_{str(i+1).zfill(3)}"
                    all_chunks.append({
                        'chunk_id': chunk_id,
                        'doc_title': doc_title,
                        'text': chunk
                    })

    # --- Print first 3 chunks ---
    print("\n--- Example Chunks (JSON) ---")
    for chunk in all_chunks[:3]:
        print(json.dumps(chunk, indent=2))

    # --- Save all chunks to JSONL ---
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    with open(output_path, 'w', encoding='utf-8') as out_file:
        for chunk in all_chunks:
            out_file.write(json.dumps(chunk) + '\n')

    print(f"\n‚úÖ Saved {len(all_chunks)} chunks to {output_path}")
    return all_chunks



# --- Run It ---
chunks = load_and_chunk_wikipedia(folder_path=NARRATIVE_DIR, output_path=CHUNKS_JSONL_PATH)



--- Example Chunks (JSON) ---
{
  "chunk_id": "2020_philadelphia_eagles_season_001",
  "doc_title": "2020_philadelphia_eagles_season",
  "text": "The 2020 season was the Philadelphia Eagles' 88th in the National Football League (NFL) and their fifth and final under head coach Doug Pederson. They failed to improve on their 9\u20137 record from the previous season following a 23\u201317 loss to the Seattle Seahawks in Week 12. They were eliminated from playoff contention for the first time since 2016 following a Week 16 loss to the Dallas Cowboys and finished with a dismal 4\u201311\u20131 record, the second-worst in the National Football Conference (NFC), and their worst since 2012. After starting 3\u20134\u20131 heading into their bye week and leading the NFC East, the Eagles would lose 7 of their last 8 games. Injuries and poor quarterback play were factors in their struggles in the season. On January 11, 2021, the Eagles announced head coach Doug Pederson would not return after the 

2. Embed the corpus and index

In [None]:
#! pip install -U sentence-transformers
#! pip install faiss-cpu
#! pip install -U bitsandbytes
#! pip install --upgrade openai
#! pip install python-dotenv

import os
import json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from openai import OpenAI
from tqdm import tqdm
import httpx


# -----------------------------------
# üìÑ Load Chunks
# -----------------------------------
def load_chunks_from_jsonl(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

# -----------------------------------
# üß† Embed Chunks
# -----------------------------------
def embed_chunks(chunks, model_name='all-MiniLM-L6-v2'):
    model = SentenceTransformer(model_name)
    texts = [chunk['text'] for chunk in chunks]
    print(f"üîÅ Embedding {len(texts)} chunks...")
    embeddings = model.encode(texts, show_progress_bar=True, batch_size=32)
    for i, emb in enumerate(embeddings):
        chunks[i]['embedding'] = emb.tolist()
    return chunks

# -----------------------------------
# üóÉÔ∏è Build FAISS Index
# -----------------------------------
def build_faiss_index(embedded_chunks, index_path, metadata_path):
    embeddings = np.array([c['embedding'] for c in embedded_chunks]).astype('float32')
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    os.makedirs(os.path.dirname(index_path), exist_ok=True)
    faiss.write_index(index, index_path)
    print(f"‚úÖ FAISS index saved to {index_path}")

    metadata = [{k: c[k] for k in ['chunk_id', 'doc_title', 'text']} for c in embedded_chunks]
    with open(metadata_path, 'w', encoding='utf-8') as f:
        json.dump(metadata, f, indent=2)
    print(f"‚úÖ Metadata saved to {metadata_path}")



if __name__ == "__main__":
    # Step 1: Load chunks from JSONL
    chunks = load_chunks_from_jsonl(CHUNKS_JSONL_PATH)
    print(f"üì¶ Loaded {len(chunks)} chunks.")

    # Step 2: Embed the chunks
    embedded_chunks = embed_chunks(chunks)

    # Step 3: Build and save FAISS index + metadata
    build_faiss_index(embedded_chunks, FAISS_INDEX_PATH, METADATA_JSON_PATH)

üì¶ Loaded 160 chunks.
üîÅ Embedding 160 chunks...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:03<00:00,  1.51it/s]

‚úÖ FAISS index saved to c:\Users\250331\Documents\GenAI Projects\EaglesGPT\data\narrative_index.faiss
‚úÖ Metadata saved to c:\Users\250331\Documents\GenAI Projects\EaglesGPT\data\narrative_metadata.json





In [None]:
import re
import faiss
import json
from sentence_transformers import SentenceTransformer
from openai import OpenAI       
import httpx

# -----------------------------------
# üîë Load API Key Securely
# -----------------------------------


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if not OPENAI_API_KEY:
    raise ValueError("‚ùå OPENAI_API_KEY environment variable not set.")

client = OpenAI(
    api_key=OPENAI_API_KEY,
    http_client=httpx.Client(verify=os.environ.get("REQUESTS_CA_BUNDLE"))
)


# -----------------------------------
# üîç Retrieve Chunks
# -----------------------------------
def extract_years(text):
    return re.findall(r'\b(?:19|20)\d{2}\b', text)

def retrieve_narrative_chunks(question, index_path, metadata_path, model_name='all-MiniLM-L6-v2', top_k=5):
    print(f"\nüîç Retrieving top {top_k} chunks for question: {question}")

    #Extract years from question to boost relevant chunks. This was needed after testing showed most relevant chunks were being missed in favor of semantic simailarity when question had "disappointing" for example.
    target_years = set(extract_years(question))

    index = faiss.read_index(index_path)
    with open(metadata_path, 'r', encoding='utf-8') as f:
        metadata = json.load(f)

    model = SentenceTransformer(model_name)
    query_vector = model.encode([question]).astype('float32')
    
    #since we are re-ranking/boosting, get more than top_k
    distances, indices = index.search(query_vector, top_k * 3)
    candidates = [metadata[i] for i in indices[0]]
   
    #score function to boost chunks with target years in title
    def score(chunk):
        title = chunk['doc_title']
        return 0 if any(y in title for y in target_years) else 1

    candidates.sort(key=score)
    results = candidates[:top_k]

    print(f"\nüìÑ Retrieved Chunks (boosted by years: {', '.join(target_years) if target_years else 'None'}):\n")
    for i, r in enumerate(results):
        print(f"[{i+1}] ({r['chunk_id']}) from {r['doc_title']}")
        print(r['text'][:300] + "\n---\n")

    return results

# -----------------------------------
# ü§ñ Answer with OpenAI
# -----------------------------------


def answer_with_openai(question, retrieved_chunks, model="gpt-3.5-turbo"):
    context = "\n\n".join(chunk["text"] for chunk in retrieved_chunks)

    system_prompt = (
        "You are a knowledgeable assistant helping summarize NFL team narratives. "
        "Use only the context provided. Do not hallucinate or include outside information."
    )

    user_prompt = f"""Answer the question based on the following Eagles season narratives:

{context}

Question: {question}
Answer:"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    answer = response.choices[0].message.content
    print("\nü§ñ OpenAI's Answer:\n")
    print(answer)
    return answer

In [17]:


# Define the question
question = "Which was more disappointing, the 2020 or 2023 Eagles season?"

# Retrieve top relevant narrative chunks
retrieved_chunks = retrieve_narrative_chunks(
    question,
    index_path=FAISS_INDEX_PATH,
    metadata_path=METADATA_JSON_PATH
)

# Use OpenAI to generate the answer from retrieved chunks
answer = answer_with_openai(question, retrieved_chunks)

# Print the answer
print("\nü¶Ö Answer from OpenAI:\n")
print(answer)


üîç Retrieving top 5 chunks for question: Which was more disappointing, the 2020 or 2023 Eagles season?

üìÑ Retrieved Chunks (boosted by years: 2023, 2020):

[1] (2020_philadelphia_eagles_season_001) from 2020_philadelphia_eagles_season
The 2020 season was the Philadelphia Eagles' 88th in the National Football League (NFL) and their fifth and final under head coach Doug Pederson. They failed to improve on their 9‚Äì7 record from the previous season following a 23‚Äì17 loss to the Seattle Seahawks in Week 12. They were eliminated from p
---

[2] (2023_philadelphia_eagles_season_001) from 2023_philadelphia_eagles_season
The 2023 season was the Philadelphia Eagles' 91st season in the National Football League (NFL), their 30th under the ownership of Jeffrey Lurie and their third under head coach Nick Sirianni. The Eagles entered the season as defending NFC champions. This season would mark the first season since 2010
---

[3] (2020_philadelphia_eagles_season_002) from 2020_philadelphia

In [19]:
from paths import FAISS_INDEX_PATH, METADATA_JSON_PATH
#from retrieve_narratives import retrieve_narrative_chunks  # assumes your original function is in this file

def get_relevant_narrative_chunks(question: str, top_k: int = 5) -> list[dict]:
    """
    Wrapper to retrieve narrative chunks given a user question.

    Args:
        question (str): The natural language question.
        top_k (int): Number of top chunks to return.

    Returns:
        List of dicts: Each chunk has 'chunk_id', 'doc_title', 'text'
    """
    try:
        return retrieve_narrative_chunks(
            question=question,
            index_path=FAISS_INDEX_PATH,
            metadata_path=METADATA_JSON_PATH,
            top_k=top_k
        )
    except Exception as e:
        print(f"‚ùå Error retrieving narrative chunks: {e}")
        return []


In [21]:
get_relevant_narrative_chunks("Which was more disappointing, the 2020 or 2023 Eagles season?", top_k=5)



üîç Retrieving top 5 chunks for question: Which was more disappointing, the 2020 or 2023 Eagles season?

üìÑ Retrieved Chunks (boosted by years: 2023, 2020):

[1] (2020_philadelphia_eagles_season_001) from 2020_philadelphia_eagles_season
The 2020 season was the Philadelphia Eagles' 88th in the National Football League (NFL) and their fifth and final under head coach Doug Pederson. They failed to improve on their 9‚Äì7 record from the previous season following a 23‚Äì17 loss to the Seattle Seahawks in Week 12. They were eliminated from p
---

[2] (2023_philadelphia_eagles_season_001) from 2023_philadelphia_eagles_season
The 2023 season was the Philadelphia Eagles' 91st season in the National Football League (NFL), their 30th under the ownership of Jeffrey Lurie and their third under head coach Nick Sirianni. The Eagles entered the season as defending NFC champions. This season would mark the first season since 2010
---

[3] (2020_philadelphia_eagles_season_002) from 2020_philadelphia

[{'chunk_id': '2020_philadelphia_eagles_season_001',
  'doc_title': '2020_philadelphia_eagles_season',
  'text': "The 2020 season was the Philadelphia Eagles' 88th in the National Football League (NFL) and their fifth and final under head coach Doug Pederson. They failed to improve on their 9‚Äì7 record from the previous season following a 23‚Äì17 loss to the Seattle Seahawks in Week 12. They were eliminated from playoff contention for the first time since 2016 following a Week 16 loss to the Dallas Cowboys and finished with a dismal 4‚Äì11‚Äì1 record, the second-worst in the National Football Conference (NFC), and their worst since 2012. After starting 3‚Äì4‚Äì1 heading into their bye week and leading the NFC East, the Eagles would lose 7 of their last 8 games. Injuries and poor quarterback play were factors in their struggles in the season. On January 11, 2021, the Eagles announced head coach Doug Pederson would not return after the season, as he was dismissed the same day. For the f