 # Implementation



 This notebook will walk you through the steps taken to implement the ensemble RAG's entire pipeline. For the baseline models you can see the implementation in `evaluation/scenarios.py`.

 ## Generate labeled data

 ## Data Preparation



 ### Data loading



 First we load the data. We'll use the `document_store.py` file for this.

In [1]:


import sys
from pathlib import Path

project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))


from sec_insights.rag.document_store import DocumentStore

# Initialize the DocumentStore with default tickers
print("🔄 Initializing DocumentStore...")
raw_data_path = project_root / "data" / "raw" / "df_filings_full.parquet"
doc_store = DocumentStore(raw_data_path=raw_data_path)

# You can also specify custom tickers of interest:
# doc_store = DocumentStore(tickers_of_interest=['AAPL', 'META', 'GOOGL'])

# Load the full dataset
print("📁 Loading the full SEC filings dataset...")
full_dataset = doc_store.get_all_sentences()
full_dataset.head()


W0621 16:33:21.103000 64498 .venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


🔄 Initializing DocumentStore...
📁 Loading the full SEC filings dataset...
📁 DocumentStore: Loading and processing raw sentence data...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

✅ Loaded 64566 sentences for 5 tickers.
⚙️  Preprocessing sentences and counting tokens...
Pre-calculating full texts for each document...
✅ Pre-calculation of full texts complete.


Unnamed: 0,ticker,fiscal_year,docID,sentenceID,sentence,section,sentence_token_count
55255,AAPL,2012,0000320193_10-K_2012,0000320193_10-K_2012_section_1_0,Item 1. Business Company Background The Compan...,1,52
55256,AAPL,2012,0000320193_10-K_2012,0000320193_10-K_2012_section_1_1,The Company’s products and services include iP...,1,49
55257,AAPL,2012,0000320193_10-K_2012,0000320193_10-K_2012_section_1_2,The Company also sells and delivers digital co...,1,29
55258,AAPL,2012,0000320193_10-K_2012,0000320193_10-K_2012_section_1_3,The Company sells its products worldwide throu...,1,39
55259,AAPL,2012,0000320193_10-K_2012,0000320193_10-K_2012_section_1_4,"In addition, the Company sells a variety of th...",1,36


 ### Chunking



 We previously determined that the optimal chunking strategy is as follows:



 - 150 average tokens per chunk

 - 50 token overlap

 - 500 maximum token limit



 So we'll chunk the full dataset according to that.

In [2]:
from sec_insights.rag.chunkers import SmartChunker

chunker = SmartChunker(target_tokens=750, overlap_tokens=150, hard_ceiling=1000)
chunks = chunker.run(full_dataset)



 ### Retrieving embeddings



 We'll use OpenAI to get the embeddings for each chunk.

In [3]:
from sec_insights.rag.embedding import EmbeddingManager
import pickle, json

# Custom unpickler to handle module path changes
class ModulePathUnpickler(pickle.Unpickler):
    def find_class(self, module, name):
        # Map old 'rag' module paths to new 'sec_insights.rag' paths
        if module.startswith('rag.'):
            module = 'sec_insights.' + module
        elif module == 'rag':
            module = 'sec_insights.rag'
        return super().find_class(module, name)

def load_chunks_with_path_fix(file_path):
    """Load pickled chunks with module path mapping."""
    with open(file_path, 'rb') as f:
        unpickler = ModulePathUnpickler(f)
        return unpickler.load()

# Now use this function instead of pickle.load
embeddings_dir = Path("../data/cache/embeddings")
if (embeddings_dir / "target_150_overlap_50_ceiling_500.pkl").exists():
    chunks = load_chunks_with_path_fix(embeddings_dir / "target_150_overlap_50_ceiling_500.pkl")
    print(f"✅ Loaded {len(chunks)} chunks from {embeddings_dir / 'target_150_overlap_50_ceiling_500.pkl'}")
else:
    print("No embeddings found, generating...")
    embedding_manager = EmbeddingManager()

    texts = [chunk.text for chunk in chunks]

    embeddings = embedding_manager.embed_texts_in_batches(texts)

    # Add embeddings to chunks
    for i, chunk in enumerate(chunks):
        chunk.embedding = embeddings[i]

    # save chunks to json

    save_dir = Path(os.getcwd()).parent / "data" / "implementation_example_files"
    save_dir.mkdir(parents=True, exist_ok=True)

    # Save chunks to JSON file
    with open(save_dir / "chunks_small_w_embeddings.json", "w") as f:
        json.dump(chunks, f, indent=2)

    print(f"✅ Saved {len(chunks)} chunks to chunks_small_w_embeddings.json")



✅ Loaded 42480 chunks from ../data/cache/embeddings/target_150_overlap_50_ceiling_500.pkl


 ## Generate labeled data



 We need to have ground truth to compare our RAG predictions to in order to evaluate their recall/precision. I will use `LangChain`'s OpenAI wrapper functionality to create QA pairs from chunks. There is some skepticism from the NLP community about the validity of LLM-generated training data or evaluation data, but due to resource/time limitations I'll assume that the LLM generated questions are valid. Considering the short context of the chunks given to the LLM, and the types of questions we're aiming for ("How much operating revenue did Tesla make in 2015?"), the risk that the metrics we obtain are entirely unreliable is low.



 In a real-world scenario, I would prefer to have a professionally labeled dataset with questions similar to what analysts/consultants may ask, with validated answers, along with daily quality checks of some sort, perhaps a rolling z-score deviation of the cosine similarity of certain clusters of documents, and an automated evaluation/tuning loop, but that's outside of the scope of this project.



 The following prompt is used:

 ```

 You are a financial analyst assistant. Your job is to generate high-quality question-answer pairs based on SEC filing text.

 INSTRUCTIONS:

 1. Generate 2 specific, answerable questions based ONLY on the provided text.

 2. Each question must explicitly include the company name and fiscal year.

 3. Provide accurate, concise answers based solely on the text content.

 4. Return your response as valid JSON in this exact format: {"qa_pairs": [{"question": "...", "answer": "..."}, ...]}

 ```

 But our first step is to stratify our sample queries to make sure that no company, year, or section is overrepresented in our evaluation set.

In [4]:

import random
from pathlib import Path

from sec_insights.evaluation.generate_qa_dataset import (
    BalancedChunkSampler,
    generate_qa_pairs,
    prepare_chunks_for_qa_generation,
)

sampler = BalancedChunkSampler(max_per_group=5)
grouped_chunks = sampler.group_chunks_by_keys(chunks)
balanced_chunks = random.sample(
    sampler.stratified_sample(grouped_chunks), 300
)  

print(f"✅ Selected {len(balanced_chunks)} balanced chunks")


🎯 Balancing to 384 chunks per company.
   - AAPL: 384 chunks
   - AMZN: 384 chunks
   - META: 384 chunks
   - NVDA: 384 chunks
   - TSLA: 384 chunks
✅ Selected 300 balanced chunks


 Now we generate all the QA pairs.

In [7]:
import os
from pprint import pprint
qa_output_path = (
    Path(os.getcwd()).parent / "data" / "processed" / "qa_dataset_300.jsonl"
)
if Path.exists(qa_output_path):
    print(f"🎉 QA pairs already generated and saved to {qa_output_path}")
    prepared_chunks = [json.loads(line) for line in open(qa_output_path, "r")]
else:
    print(f"🔄 Generating QA pairs...")
    prepared_chunks = prepare_chunks_for_qa_generation(balanced_chunks)
    generate_qa_pairs(prepared_chunks, qa_output_path, debug_mode=False)
    print(f"🎉 Generated ~{len(balanced_chunks)} questions saved to {qa_output_path}")

pprint(prepared_chunks[:2])


🎉 QA pairs already generated and saved to /Users/jon/GitHub/dowjones-takehome/data/processed/qa_dataset_300.jsonl
[{'answer': '$2.0 million',
  'chunk_id': '333ed609-2dc7-5a4f-b1cd-ea2041679ee1',
  'human_readable_id': 'TSLA_2014_7A_3',
  'question': 'What was the amount of gains recorded by the company due to '
              'foreign currency exchange transactions for the fiscal year '
              'ended December 31, 2014?',
  'section': '7A',
  'section_letter': 'A',
  'section_num': '7',
  'source_text': 'As a result of a favorable foreign currency exchange impact '
                 'from foreign currency-denominated liabilities, especially '
                 'related to the Japanese yen, we recorded gains of $2.0 '
                 'million on foreign exchange transactions in other income '
                 '(expense), net, for the year ended December 31, 2014. '
                 'Interest Rate Risk We had cash and cash equivalents totaling '
                 '$1.91 billion as of

 Notice that some of the questions don't specifically mention the company name, even when prompted. I played around with a lot of prompts to get it to generate the company name consistently, but to no avail. This could be the target for fine tuning at a later stage.



 My short term solution is to inject the information into the beginning of the question like so:



 ```python

 ```

 ## Parameter optimization

 First we should optimize the number of tokens per chunk split. I ran 50 questions on four different splits to optimize for recall, MRR, and Rouge.

In [11]:


import json
import os
import sys
from pathlib import Path
from pprint import pprint
import pandas as pd

sys.path.append(str(Path(os.getcwd()).parent))


data_path = Path(os.getcwd()).parent / "data"

configs = [
    {"target_tokens": 150, "overlap_tokens": 25, "name": "Small_150_25"},
    {"target_tokens": 300, "overlap_tokens": 50, "name": "Medium_300_50"},
    {"target_tokens": 500, "overlap_tokens": 100, "name": "Large_500_100"},
    {"target_tokens": 750, "overlap_tokens": 150, "name": "XLarge_750_150"},
]

## the resulting CSV from this is long and poorly formatted, I've put in markdown below

# df_results = compare_chunking_configs(num_questions=50, configs=configs)
# df_results.to_csv(data_path / 'small_rerun_results.csv')
df_results = pd.read_csv(data_path / 'results' / 'archived_results'/'summaries'/'chunking_comparison_all_configs_20250620_184558.csv')
df_results


Unnamed: 0,configuration,target_tokens,overlap_tokens,hard_ceiling,total_chunks,timestamp,rag_recall_at_1,rag_recall_at_3,rag_recall_at_5,rag_recall_at_10,...,ensemble_rerank_rag_adj_recall_at_5,ensemble_rerank_rag_adj_recall_at_10,ensemble_rerank_rag_adj_mrr,ensemble_rerank_rag_rouge1_f,ensemble_rerank_rag_rouge2_f,ensemble_rerank_rag_rougeL_f,ensemble_rerank_rag_avg_prompt_tokens,ensemble_rerank_rag_avg_completion_tokens,ensemble_rerank_rag_avg_total_tokens,ensemble_rerank_rag_total_cost
0,XLarge_750_150_1000,750,150,1000,4924,20250620_182003,0.04,0.04,0.04,0.06,...,0.230769,0.230769,0.192308,0.142148,0.089563,0.122328,2912.230769,52.0,2964.230769,0.023477
1,Large_500_100_800,500,100,800,7438,20250620_165509,0.16,0.16,0.16,0.18,...,0.18,0.18,0.17,0.458171,0.339505,0.413018,1978.34,54.78,2033.12,0.016556
2,Medium_350_100_800,350,100,800,12654,20250620_160551,0.12,0.12,0.12,0.14,...,0.14,0.14,0.13,0.464168,0.348264,0.423862,1549.34,55.26,1604.6,0.013353
3,Small_150_50_500,150,50,500,42480,20250620_142804,0.26,0.42,0.44,0.5,...,0.510204,0.530612,0.391691,0.463828,0.356921,0.428215,934.0,53.591837,987.591837,0.008688


For simplicity, we'll look at Recall@5, RougeL, and nDCG@10

| Configuration          | Vanilla Recall\@5 | Reranked Recall\@5 | Ensemble Recall\@5 |
| :--------------------- | :---------------: | :----------------: | :----------------: |
| XLarge\_750\_150\_1000 |       0.040       |        0.060       |      **0.231**     |
| Large\_500\_100\_800   |       0.160       |      **0.180**     |      **0.180**     |
| Medium\_350\_100\_800  |       0.120       |      **0.140**     |      **0.140**     |
| Small\_150\_50\_500    |       0.440       |      **0.540**     |        0.490       |

| Configuration          | Vanilla ROUGE-L | Reranked ROUGE-L | Ensemble ROUGE-L |
| :--------------------- | :-------------: | :--------------: | :--------------: |
| XLarge\_750\_150\_1000 |      0.101      |     **0.124**    |       0.122      |
| Large\_500\_100\_800   |      0.323      |       0.355      |     **0.413**    |
| Medium\_350\_100\_800  |      0.334      |       0.349      |     **0.424**    |
| Small\_150\_50\_500    |      0.354      |       0.373      |     **0.428**    |

| Configuration          | Vanilla nDCG\@10 | Reranked nDCG\@10 | Ensemble nDCG\@10 |
| :--------------------- | :--------------: | :---------------: | :---------------: |
| XLarge\_750\_150\_1000 |       0.047      |       0.060       |     **0.202**     |
| Large\_500\_100\_800   |       0.167      |     **0.180**     |       0.173       |
| Medium\_350\_100\_800  |       0.127      |     **0.140**     |       0.133       |
| Small\_150\_50\_500    |       0.388      |     **0.450**     |       0.413       |



 Takeaways:

 - Small configs consistently perform higher than other configs

 - Reranked in small configs perform better with recall and ndcg@10, but underperform with rouge. Meaning our reranker isn't reranking properly.



 Key takeaway for now is to keep the 150/50/500 batch size, and move on to testing all models.

 ## Baseline scenarios

 ### Vanilla `gpt-4o-mini`



 This implementation is simplest. We simply feed the API the question without context, and evaluate the answer.

In [12]:
# load qa set
with open(
    Path(os.getcwd()).parent / "data" / "processed" / "qa_dataset_300.jsonl", "r"
) as f:
    qa_set = [json.loads(line) for line in f]

from openai import OpenAI

from sec_insights.evaluation.scenarios import run_baseline_scenario

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

qa_item = random.choice(qa_set)

baseline_output = run_baseline_scenario(openai_client, qa_item)
print(f"Question: {qa_item['question']}")
print(f"Expected: {qa_item['answer']}")
pprint(baseline_output[0])


Question: What was the amount of gains recorded by the company due to foreign currency exchange transactions for the fiscal year ended December 31, 2014?
Expected: $2.0 million
("I don't have access to specific figures from Tesla's SEC filings for fiscal "
 'year 2014, including the exact amount of gains from foreign currency '
 'exchange transactions. Generally, companies report such gains or losses in '
 'their financial statements, typically within the notes to the financial '
 'statements or in the management discussion and analysis section')


 ### `gpt-4o-mini` with web search

In [13]:
from sec_insights.evaluation.scenarios import run_web_search_scenario

response, tokens_used = run_web_search_scenario(openai_client, qa_item)

print(f"Question: {qa_item['question']}")
print(f"Expected: {qa_item['answer']}")
pprint(f"Web Search answer: {response}")

Question: What was the amount of gains recorded by the company due to foreign currency exchange transactions for the fiscal year ended December 31, 2014?
Expected: $2.0 million
('Web Search answer: In its fiscal year ending December 31, 2014, Tesla '
 'recorded foreign currency transaction gains of $2.0 million, primarily due '
 'to favorable exchange rate impacts from foreign currency-denominated '
 'liabilities, especially related to the Japanese yen. '
 '([sec.gov](https://www.sec.gov/Archives/edgar/data/1318605/000156459015001031/tsla-10k_20141231.htm?utm_source=openai))')


 ### `gpt-4o-mini` with full context



 This is the most wasteful but interesting baseline to use. It uploads an entire SEC 10-K filing as context, and gets the model to parse the whole document for the answer.

In [14]:
# Full context GPT search - shortest possible
from sec_insights.evaluation.scenarios import run_unfiltered_context_scenario

# Load QA dataset and pick random question
with open(
    Path(os.getcwd()).parent / "data" / "processed" / "qa_dataset_300.jsonl", "r"
) as f:
    qa_set = [json.loads(line) for line in f]

qa_item = random.choice(qa_set)

# Run full context scenario (gets full filing text + asks question)
answer, token_usage = run_unfiltered_context_scenario(doc_store, openai_client, qa_item)

pprint(f"Question: {qa_item['question']}")
pprint(f"Expected: {qa_item['answer']}")
pprint(f"Full Context GPT: {answer}")
print(f"Tokens used: {token_usage['total_tokens']}")


('Question: What does the maintenance plan cover for the company in the fiscal '
 'year?')
('Expected: The maintenance plans cover annual inspections and the replacement '
 'of wear and tear parts, excluding tires and the battery.')
('Full Context GPT: The maintenance plan for Tesla vehicles covers annual '
 'inspections and the replacement of wear and tear parts, excluding tires and '
 'the battery. Additionally, customers have the option to purchase an extended '
 'service plan, which provides coverage for the repair or replacement of '
 'vehicle parts for an additional four years or up to an additional 50')
Tokens used: 56564


 ## RAG scenarios

 ### Vanilla RAG

 This RAG will be very simple.



 ![Vanilla RAG](../images/vanilla-rag-flow.png)

 We send the embeddings into the vector DB.



 The user query is parsed through OpenAI to match their query to metadata if available. Specifically, extract a dictionary of `fiscal_year` and `ticker`. Only vectors that match that fiscal year and ticker are searched.



 The vector DB returns the top N vectors (currently N=10), which are then fed as context to Open AI to find the answer.

 #### Instantiate the RAG pipeline

 The `RAGPipeline` object will automatically call data; the above examples were for demonstration.

In [15]:
from sec_insights.rag.vector_store import VectorStore
from sec_insights.rag.embedding import EmbeddingManager

embedding_manager = EmbeddingManager()

# 0. Load chunks into vector DB with metadata and UUIDs
vs = VectorStore(use_docker=False, embedding_manager=embedding_manager)

# Prepare chunks with all metadata and IDs preserved
chunk_dicts = prepare_chunks_for_qa_generation(chunks)
embeddings_list = [chunk.embedding for chunk in chunks]

# Verify we have the right structure (metadata, id, text)
print(f"Sample chunk keys: {list(chunk_dicts[0].keys())}")
print(f"Sample chunk id: {chunk_dicts[0]['id']}")
print(f"Sample metadata: {chunk_dicts[0]['metadata']}")



Sample chunk keys: ['id', 'text', 'metadata', 'embedding']
Sample chunk id: 745fd8e0-6017-5d5c-9022-4708e7981365
Sample metadata: {'ticker': 'AAPL', 'fiscal_year': 2012, 'section': '1', 'section_num': '1', 'section_letter': '', 'section_desc': 'Business', 'human_readable_id': 'AAPL_2012_1_0', 'seq': 0, 'slice_idx': 0}


 Now we upload the chunks into the vector store, ask a question,

In [16]:
# Upsert with embeddings, metadata, and UUIDs
vs.upsert_chunks(chunk_dicts, embeddings_list)
print(f"✅ Loaded {len(chunk_dicts)} chunks with metadata into vector DB")

# For answering questions, you need to use the search method and then generate answers
import json
import random

qa_set = [json.loads(line) for line in open(qa_output_path, "r")]
qa_item = random.choice(qa_set)

# Use the search method to get relevant chunks
query_embedding = embedding_manager.embed_texts_in_batches([qa_item["question"]])[0]
search_results = vs.search(query_vector=query_embedding, top_k=10)

print(f"Question: {qa_item['question']}")
print(f"Expected: {qa_item['answer']}")
print(f"Retrieved {len(search_results)} chunks")
print(f"Top result score: {search_results[0]['score'] if search_results else 'No results'}")

# To get a generated answer, you'd need to use a full RAG pipeline or generate manually
# For now, let's just show the retrieved context
if search_results:
    context = search_results[0]['payload']['text']
    print(f"Top retrieved context: {context[:200]}...")


  self.client.upsert(


✅ Loaded 42480 chunks with metadata into vector DB
Question: What does the maintenance plan cover for the company in the fiscal year?
Expected: The maintenance plans cover annual inspections and the replacement of wear and tear parts, excluding tires and the battery.
Retrieved 10 chunks
Top result score: 0.6060109600118968
Top retrieved context: Maintenance and Service Plans We offer a prepaid maintenance program for our vehicles, which includes plans covering maintenance for up to four years or up to 50,000 miles, provided these services are...


 ### RAG with Re-Ranker

 ![Reranking Rag](../images/rag-rerank-flow.png)

 With our re-ranker, we get the top 20 vectors by cosine similarity, and let the reranker get the ten most relevant vectors to send to the LLM.



 The BAAI/bge-reranker-base cross-encoder transformer assigns each query–vector pair a relevance logit. Unlike cosine similarity—which only measures the directional closeness of two independent embeddings, the reranker prepends/appends the query and document with [CLS] and [SEP] tokens, uses cross-attention to capture fine-grained semantic relations, and then ranks the vectors according to their logit scores.

In [19]:
from sec_insights.rag.reranker import BGEReranker
from sec_insights.rag.generation import AnswerGenerator

reranker = BGEReranker()
answer_generator = AnswerGenerator()
qa_item = random.choice(qa_set)

# Get 20 results, rerank to top 10, generate answer
query_embedding = embedding_manager.embed_texts_in_batches([qa_item["question"]])[0]
search_results = vs.search(query_vector=query_embedding, top_k=20)
texts = [r["payload"]["text"] for r in search_results]

reranked_indices = reranker.rerank(qa_item["question"], texts, top_k=10)
reranked_tuples = reranker.rerank(qa_item["question"], texts, top_k=10)

reranked_indices = [idx for idx, score in reranked_tuples]
reranked_results = [search_results[i] for i in reranked_indices]

result = answer_generator.generate_answer(qa_item["question"], reranked_results)



BGEReranker using device: mps


In [20]:
pprint(f"Question: {qa_item['question']}")
pprint(f"Expected: {qa_item['answer']}")
pprint(f"Reranked RAG: {result['answer']}")


('Question: What does the maintenance plan cover for the company in the fiscal '
 'year?')
('Expected: The maintenance plans cover annual inspections and the replacement '
 'of wear and tear parts, excluding tires and the battery.')
('Reranked RAG: The maintenance plans for the company cover annual inspections '
 'and the replacement of wear and tear parts, excluding tires and the battery. '
 'These plans can be prepaid and typically cover maintenance for up to four '
 'years or 50,000 miles, depending on the specific service purchased. Payments '
 'collected in advance are recorded as deferred')


 ### Ensemble Reranked RAG

 ![Ensemble RAG](../images/rag-ensemble-flow.png)

 After expanding the input query with an OpenAI call, the pipeline retrieves the top 20 documents by vector search and then applies two separate cross‐encoder rerankers, `BAAI/bge‐reranker‐base` and `jinaai/jina‐reranker‐v1‐base‐en` to each (query, document) pair. Each reranker outputs a relevance score with its [CLS]/[SEP] cross‐attention mechanism. Those scores are min–max normalized independently, averaged to form a fused score, and used to pick the final top 10. Finally, the selected passages are fed into a generative reader (AnswerGenerator) alongside the original question to produce the answer.

In [22]:
# Ensemble Reranked RAG - simplified version
from sec_insights.rag.reranker import BGEReranker
from sec_insights.evaluation.scenarios_financerag import CrossEncoder
import random
import numpy as np
# Initialize models
print("🔄 Loading ensemble rerankers...")
bge_reranker = BGEReranker()
jina_reranker = CrossEncoder(
    "jinaai/jina-reranker-v2-base-multilingual", trust_remote_code=True
)
answer_generator = AnswerGenerator()


qa_item = random.choice(qa_set)

# expand the query
expanded_query_prompt = f"""
Expand this financial question with relevant financial keywords and context:
Question: {qa_item['question']}

Return just the expanded question, nothing else.
"""

response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": expanded_query_prompt}],
    max_tokens=100,
    temperature=0,
)
expanded_query = response.choices[0].message.content.strip()

# start retrieval
query_embedding = embedding_manager.embed_texts_in_batches([qa_item["question"]])[0]
search_results = vs.search(query_vector=query_embedding, top_k=20)
texts = [r["payload"]["text"] for r in search_results]

# ensemble reranking
bge_tuples = bge_reranker.rerank(expanded_query, texts, top_k=20)
bge_scores = np.array([score for idx, score in bge_tuples])

jina_scores = jina_reranker.predict([(expanded_query, text) for text in texts])

# normalize/fuse scores
bge_norm = (bge_scores - bge_scores.min()) / (
    bge_scores.max() - bge_scores.min() + 1e-6
)
jina_norm = (jina_scores - jina_scores.min()) / (
    jina_scores.max() - jina_scores.min() + 1e-6
)
fused_scores = (bge_norm + jina_norm) / 2

# get final results
final_indices = np.argsort(fused_scores)[::-1][:10]
final_results = [search_results[i] for i in final_indices]

# get the answer
result = answer_generator.generate_answer(qa_item["question"], final_results)

print(f"Original Query: {qa_item['question']}")
print(f"Expanded Query: {expanded_query}")
print(f"Expected: {qa_item['answer']}")
print(f"Ensemble RAG: {result['answer']}")


🔄 Loading ensemble rerankers...
BGEReranker using device: mps


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Original Query: What does the maintenance plan cover for the company in the fiscal year?
Expanded Query: What specific components and services are included in the maintenance plan for the company during the current fiscal year, and how do these elements impact the overall budget allocation, operational efficiency, and long-term asset management strategy? Additionally, what are the projected costs associated with the maintenance plan, and how do they align with the company's financial forecasts and performance metrics?
Expected: The maintenance plans cover annual inspections and the replacement of wear and tear parts, excluding tires and the battery.
Ensemble RAG: The maintenance plan for the company covers annual inspections and the replacement of wear and tear parts, excluding tires and the battery. Plans are available for up to eight years or 100,000 miles, depending on the specific service purchased within a designated timeframe. Additionally, there is an Extended Service plan that 

 # Next step: Run and evaluate



 See the notebook "Evaluation.ipynb" for comparing all models.