# Contextual Chunk Headers (CCH)

## Overview
CCH prepends **higher-level context** (document title, summary) to chunks before embedding.
This gives embeddings a more accurate representation of what the chunk is actually about.

### The Problem
Chunks often refer to subjects via pronouns or implicit references. Without context, a chunk about
"climate change risks to operations" doesn't mention "Nike" — so it won't match a query like
"Nike climate change impact".

### The Fix
```
Before:  "Given the broad scope of our operations, we are vulnerable to physical risks of climate change..."
After:   "Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K\n\nGiven the broad scope of our operations..."
```

## Pipeline
```
Document → LLM generates title → Chunk text → Prepend title to each chunk → Embed → FAISS
```

In [1]:
import os
import sys
from dotenv import load_dotenv
from typing import List, Dict, Any, Optional, Tuple
import json

load_dotenv()

True

In [2]:
os.chdir(r"C:\Users\TempAccess\Documents\Dhruv\RAG")
print(os.getcwd())

C:\Users\TempAccess\Documents\Dhruv\RAG


In [3]:
from helper_function_openai import (
    Document,
    RetrievalResult,
    OpenAIEmbedder,
    FAISSVectorStore,
    OpenAIChat,
    chunk_text,
    cosine_similarity,
)

print("Helpers imported")

Helpers imported


# Load Document and split into chunks

In [4]:
FILE_PATH = r"data\nike_2023_annual_report.txt"
FILE_PATH

'data\\nike_2023_annual_report.txt'

In [6]:
with open(FILE_PATH, "r", encoding="utf-8") as f:
    document_text = f.read()

print(document_text)

FORM 10-K FORM 10-KUNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C.
Washington, D.C. 20549
FORM 10-K 
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE FISCAL YEAR ENDED MAY 31, 2023 
OR
☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FOR THE TRANSITION PERIOD FROM TO .
Commission File No. 1-10635 
NIKE, Inc. 
(Exact name of Registrant as specified in its charter)
Oregon 93-0584541
(State or other jurisdiction of incorporation) (IRS Employer Identification No.)
One Bowerman Drive, Beaverton, Oregon 97005-6453 
(Address of principal executive offices and zip code)
(503) 671-6453 
(Registrant's telephone number, including area code)
SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:
Class B Common Stock NKE New York Stock Exchange
(Title of each class) (Trading symbol) (Name of each exchange on which registered)
SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE

In [7]:
print(f"Document length: {len(document_text)} chars")

Document length: 374938 chars


In [9]:
## created the simple chunks (no overlap, same as original)

chunks = chunk_text(document_text, chunk_size=1000, chunk_overlap=0)

print(f"Created {len(chunks)} chunks")

Created 398 chunks


## Generate the Document Title

In [10]:
llm = OpenAIChat(
    model_name="gpt-4o-mini",
    temperature=0.2,
    max_tokens=200
)

In [11]:
embedder = OpenAIEmbedder(model="text-embedding-3-small")

In [12]:
def get_document_title(document_text:str, guidance:str="")->str:
    """
    Use LLM to generate a descriptive document title.
    
    Replaces: get_document_title() with manual tiktoken truncation + OpenAI client
    """
    truncated = " ".join(document_text.split()[:3000])
    truncation_note = (
        "Note: the text below is just the first ~3000 words of the document. "
        "Your response should still pertain to the entire document."
        if len(document_text.split()) > 3000 else ""
    )

    messages = [
        {
            "role":"user",
            "content": (
                "What is the title of the following document?\n\n"
                "Your response MUST be the title of the document, and nothing else. "
                "DO NOT respond with anything else.\n\n"
                f"{guidance}\n\n"
                f"{truncation_note}\n\n"
                f"DOCUMENT\n{truncated}"
            )
        }
    ]

    return llm.chat(messages)

In [13]:
document_title = get_document_title(document_text)
print(f"Document Title: {document_title}")

Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K


## Now lets create a chunk with and without the header

In [14]:
def create_chunk_with_header(document_text:str, document_summary:str="")->str:
    """
    Build the contextual header to prepend to each chunk.
    
    Can include:
    - Document title (most important)
    - Document summary (optional)
    - Section titles (optional, if you have them)
    """
    header = f"Document Title: {document_title}\n\n"
    if document_summary:
        header += f"Document Summary: {document_summary}\n\n"
    return header


In [15]:
def apply_chunk_headers(chunks:list[str], header:str)->list[str]:
    """
    Prepend the same header to every chunk.
    """
    return [f"<header>{header}</header>\n\n{chunk}" for chunk in chunks]

In [16]:
chunk_header = create_chunk_with_header(document_title)
print(f"Chunk Header: {chunk_header}")

Chunk Header: Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K




In [17]:
chunk_with_header = apply_chunk_headers(chunks, chunk_header)
print(f"\n Chunks Without headers: {len(chunks)}")
print(f"\n Chunks With headers: {len(chunk_with_header)}")


 Chunks Without headers: 398

 Chunks With headers: 398


# Measure Impact — CCH vs No-CCH

In [21]:
def measure_similarity(query:str, texts:List[str])->List[float]:
    """
    Measure cosine similarity between query and each text.
    """
    
    query_emb = embedder.embed_text(query)
    text_embs = embedder.embed_texts(texts)
    
    return [cosine_similarity(query_emb, te) for te in text_embs]

In [22]:
def compare_chunk_similarity(
    chunk_index:int,
    chunks:List[str],
    document_title:str,
    query:str
):

    """
    Compare similarity scores for a chunk with and without contextual header.
    """ 
    chunk_text_str = chunks[chunk_index]
    chunk_wo_header = chunk_text_str
    chunk_w_header = f"Document Title: {document_title}\n\n{chunk_text_str}"

    scores = measure_similarity(query, [chunk_wo_header, chunk_w_header])

    print(f"\nChunk header:")
    print(f"  Document Title: {document_title}")
    print(f"\nChunk text (index {chunk_index}):")
    print(f"  {chunk_text_str[:300]}...")
    print(f"\nQuery: {query}")
    print(f"\nSimilarity WITHOUT contextual chunk header: {scores[0]:.6f}")
    print(f"Similarity WITH contextual chunk header:    {scores[1]:.6f}")
    
    improvement = scores[1] - scores[0]
    print(f"\n→ Improvement: +{improvement:.6f} ({improvement/max(scores[0], 0.001)*100:.1f}%)")

In [23]:
CHUNK_INDEX = 86

QUERY = "Nike climate change impact"

compare_chunk_similarity(CHUNK_INDEX, chunks, document_title, QUERY)


Chunk header:
  Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

Chunk text (index 86):
  demand for our products could decline, and if we experience problems with the quality of our products, we may incur substantial 
expense to remedy the problems and loss of consumer confidence.
Failure to continue to obtain or maintain high-quality endorsers of our products could harm our business.
W...

Query: Nike climate change impact

Similarity WITHOUT contextual chunk header: 0.299195
Similarity WITH contextual chunk header:    0.389380

→ Improvement: +0.090185 (30.1%)


In [25]:
# Try more queries
test_queries = [
    "Nike revenue growth",
    "supply chain disruptions",
    "executive compensation",
    "brand marketing strategy"
]

# Pick a few different chunks to test
test_chunk_indices = [10, 30, 50, 86]

for ci in test_chunk_indices:
    if ci < len(chunks):
        for q in test_queries[:2]:  # test 2 queries per chunk
            compare_chunk_similarity(ci, chunks, document_title, q)
            print(f"\n{'─'*60}")


Chunk header:
  Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

Chunk text (index 10):
  Our wholly-owned subsidiary brand, Converse, headquartered in Boston, Massachusetts, designs, distributes and licenses 
casual sneakers, apparel and accessories under the Converse, Chuck Taylor, All Star, One Star, Star Chevron and Jack Purcell 
trademarks. Operating results of the Converse brand ar...

Query: Nike revenue growth

Similarity WITHOUT contextual chunk header: 0.475839
Similarity WITH contextual chunk header:    0.467415

→ Improvement: +-0.008424 (-1.8%)

────────────────────────────────────────────────────────────

Chunk header:
  Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

Chunk text (index 10):
  Our wholly-owned subsidiary brand, Converse, headquartered in Boston, Massachusetts, designs, distributes and licenses 
casual sneakers, apparel and accessories under the Converse, Chuck Taylor, All Star, One Star, Star Chevron and Jack Purcell 
trademarks. Operating re

# Build FULL RAG INDEX - with and without CCH

In [26]:
def build_index(texts: List[str], source: str = "") -> FAISSVectorStore:
    """Build a FAISS index from a list of text strings."""
    documents = [
        Document(content=t, metadata={"source": source, "chunk_id": i})
        for i, t in enumerate(texts)
    ]
    documents = embedder.embed_documents(documents)
    store = FAISSVectorStore(dimension=embedder.dimension)
    store.add_documents(documents)
    return store

In [27]:
print("Building index WITHOUT CCH...")
store_no_cch = build_index(chunks, source=FILE_PATH)
print(f"  Indexed {store_no_cch.index.ntotal} chunks")


Building index WITHOUT CCH...
  Indexed 398 chunks


In [29]:
print("\nBuilding index WITH CCH...")
store_with_cch = build_index(chunk_with_header, source=FILE_PATH)
print(f"  Indexed {store_with_cch.index.ntotal} chunks")


Building index WITH CCH...
  Indexed 398 chunks


In [32]:
def compare_retrieval(query:str, k:int=3):
    """ 
    Compare retrieval results with cch and without cch
    """ 

    query_embed = embedder.embed_text(query)

    result_no_cch = store_no_cch.search(query_embed, k=k)
    result_with_cch = store_with_cch.search(query_embed, k=k)

    print(f"\nQuery: {query}")
    print(f"{'='*80}")
    
    print(f"\n--- WITHOUT CCH ---")
    for r in result_no_cch:
        print(f"  [{r.score:.4f}] Chunk #{r.document.metadata['chunk_id']}: {r.document.content[:120]}...")
    
    print(f"\n--- WITH CCH ---")
    for r in result_with_cch:
        # Strip header for display
        content = r.document.content
        if "\n\n" in content:
            content = content.split("\n\n", 1)[1]  # show chunk text without header
        print(f"  [{r.score:.4f}] Chunk #{r.document.metadata['chunk_id']}: {content[:120]}...")
    
    # Check overlap
    no_cch_ids = {r.document.metadata['chunk_id'] for r in result_no_cch}
    cch_ids = {r.document.metadata['chunk_id'] for r in result_with_cch}
    print(f"\n  No-CCH chunks: {sorted(no_cch_ids)}")
    print(f"  CCH chunks:    {sorted(cch_ids)}")
    print(f"  Overlap:       {len(no_cch_ids & cch_ids)}/{k}")

In [33]:
compare_retrieval("Nike climate change impact")


Query: Nike climate change impact

--- WITHOUT CCH ---
  [0.4651] Chunk #75: •Decreased retail traffic as a result of store closures, reduced operating hours, social distancing restrictions and/or ...
  [0.4597] Chunk #160: impacted gross margin for fiscal 2023. The strategic pricing actions we have taken partially offset the impacts of these...
  [0.4523] Chunk #190: NIKE, INC.      38 GREATER CHINA
(Dollars in millions) FISCAL 2023 FISCAL 2022 % CHANGE% CHANGE 
EXCLUDING 
CURRENCY 
CH...

--- WITH CCH ---
  [0.5187] Chunk #66: </header>

Various countries and regions are following different approaches to the regulation of climate change, which c...
  [0.5116] Chunk #68: </header>

to greenhouse gas emissions, carbon costs or climate-related goals; adapting products to customer preferences...
  [0.5095] Chunk #64: </header>

in part on high quality merchandising and an appealing retail environment to attract consumers, which require...

  No-CCH chunks: [75, 160, 190]
  CCH chunks:  

In [34]:
compare_retrieval("Nike revenue and financial performance")


Query: Nike revenue and financial performance

--- WITHOUT CCH ---
  [0.6552] Chunk #185: •North America revenues increased 18% on a currency-neutral basis, primarily due to higher revenues in Men's and the 
Jo...
  [0.6486] Chunk #173: Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. 
...
  [0.6457] Chunk #174: •NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men's. 
Unit ...

--- WITH CCH ---
  [0.6373] Chunk #184: </header>

(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "...
  [0.6362] Chunk #173: </header>

Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to N...
  [0.6355] Chunk #174: </header>

•NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Me...

  No-CCH chunks: [173, 174, 185

In [35]:
compare_retrieval("supply chain risks and disruptions")


Query: supply chain risks and disruptions

--- WITHOUT CCH ---
  [0.5826] Chunk #73: lead to adverse impacts to our global supply chain, factory cancellation costs, store closures, and a decline in retail ...
  [0.5657] Chunk #117: suppliers and manufacturers in our methods, products, quality control standards and labor , health and safety standards....
  [0.5629] Chunk #56: geopolitical conflicts have impacted and may continue to impact the availability , pricing and timing for obtaining comm...

--- WITH CCH ---
  [0.4686] Chunk #73: </header>

lead to adverse impacts to our global supply chain, factory cancellation costs, store closures, and a decline...
  [0.4531] Chunk #74: </header>

of facility closures, increased operating costs, reductions in operating hours, labor shortages, and real tim...
  [0.4506] Chunk #72: </header>

key manufacturing or distribution locations or quickly repair damage to our Information Technology Systems or...

  No-CCH chunks: [56, 73, 117]
  CCH chu

In [36]:
def get_document_summary(document_text: str) -> str:
    """Generate a concise summary of the document for use in chunk headers."""
    truncated = " ".join(document_text.split()[:3000])
    
    messages = [
        {
            "role": "system",
            "content": "You generate concise document summaries for use as context in RAG systems."
        },
        {
            "role": "user",
            "content": (
                "Write a 2-3 sentence summary of this document. "
                "Focus on what the document is about and who it pertains to.\n\n"
                f"Document:\n{truncated}"
            )
        }
    ]
    
    return llm.chat(messages)

In [37]:
doc_summary = get_document_summary(document_text)
print(f"Document Summary: {doc_summary}")

Document Summary: The document is the Form 10-K annual report for NIKE, Inc. for the fiscal year ending May 31, 2023, detailing the company's business operations, financial performance, and market strategies. It pertains to NIKE, Inc., a leading global seller of athletic footwear and apparel, and includes information on product offerings, sales channels, manufacturing processes, and risk factors affecting the company's operations.


In [39]:
# Build enhanced header with title + summary
enhanced_header = create_chunk_with_header(document_title, doc_summary)
print(f"Enhanced header:\n{enhanced_header}")

# Apply and build index
chunks_with_enhanced = apply_chunk_headers(chunks, enhanced_header)

print(f"\nBuilding index with title + summary header...")
store_enhanced = build_index(chunks_with_enhanced, source=FILE_PATH)
print(f"Indexed {store_enhanced.index.ntotal} chunks")

Enhanced header:
Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

Document Summary: The document is the Form 10-K annual report for NIKE, Inc. for the fiscal year ending May 31, 2023, detailing the company's business operations, financial performance, and market strategies. It pertains to NIKE, Inc., a leading global seller of athletic footwear and apparel, and includes information on product offerings, sales channels, manufacturing processes, and risk factors affecting the company's operations.



Building index with title + summary header...
Indexed 398 chunks


In [41]:
# Three-way comparison

query = "Nike Climate change impact"

query_embed = embedder.embed_text(query)

print(query_embed)

[0.016595248132944107, 0.011081387288868427, 0.00972639862447977, 0.01648792251944542, -0.010839903727173805, -0.030963487923145294, -0.02282014489173889, -0.013006542809307575, 0.003434424987062812, 0.01690381020307541, 0.041454583406448364, 0.04306447133421898, -0.06702495366334915, 0.006120923440903425, -0.07990404218435287, 0.024362951517105103, -0.024094637483358383, -0.009934342466294765, -0.00688226567581296, 0.0033488997723907232, -0.010772825218737125, 0.041883885860443115, -0.01984185352921486, 0.0060135978274047375, 0.005996827967464924, 0.021760301664471626, 0.0007642769487574697, -0.021424908190965652, 0.05806324630975723, 0.061175696551799774, 0.01777583174407482, -0.03388811647891998, -0.015186597593128681, 0.007613422814756632, 0.044406041502952576, -0.023705581203103065, -0.021746886894106865, 0.02658996172249317, 0.0058861481957137585, 0.004346694331616163, 0.02141149342060089, -0.00971969123929739, 0.022887222468852997, 0.013610250316560268, -0.023544592782855034, 0.

In [44]:
print(f"Query: {query}\n")

for label, store in [("No CCH", store_no_cch), ("Title only", store_with_cch), ("Title + Summary", store_enhanced)]:
    results = store.search(query_embed, k=2)
    print(f"--- {label} ---")
    for r in results:
        content = r.document.content
        # Strip header for display
        if "\n\n" in content and content.startswith("Document"):
            content = content.split("\n\n", 1)[1]
        print(f"  [{r.score:.4f}] {content[:300]}...")
    print()

Query: Nike Climate change impact

--- No CCH ---
  [0.4808] •Decreased retail traffic as a result of store closures, reduced operating hours, social distancing restrictions and/or changes in 
consumer behavior;
•Reduced consumer demand for our products, including as a result of a rise in unemployment rates, higher costs of 
borrowing, inflation and diminishe...
  [0.4647] impacted gross margin for fiscal 2023. The strategic pricing actions we have taken partially offset the impacts of these higher 
costs.
• Supply Chain Volatility:  Supply chain challenges, macroeconomic conditions and the impact of the COVID-19 pandemic 
on the manufacturing of our product disrupted...

--- Title only ---
  [0.5325] <header>Document Title: NIKE, INC. ANNUAL REPORT ON FORM 10-K

</header>

Various countries and regions are following different approaches to the regulation of climate change, which could increase the 
complexity of, and potential cost related to complying with, such regulations. Any of t