# Mitigating Echo Chambers



> Ariele Mairani - 517282 - Communication Leader

> Nicolò Vella - 535374

> Luca Santagati - 517405


This project aims to develop a specialized Information Retrieval (IR) system for financial question answering, specifically designed to address the "Filter Bubble" effect often found in financial domains. Financial queries often require processing heterogeneous data—ranging from official reports to informal microblogs—where high relevance scores often correlate with dominant viewpoints, potentially leading to biased decision-making.

Our primary research goal is to investigate to what extent a sentiment-diversified re-ranking framework can mitigate the formation of echo chambers without compromising the system's retrieval relevance.

To evaluate our research question, the system implements and compares multiple retrieval methods, categorized into standard baselines and advanced neural/diversified approaches.

* **Baseline Models:**
The system utilizes three standard IR baselines to establish a performance benchmark:
  * **BM25**: A classic sparse retriever used for efficient first-stage ranking based on keyword overlap.
  * **TF-IDF**: A fundamental statistical measure used to evaluate how important a word is to a document in a collection.
  * BM25 + **RM3**: An expansion of the BM25 baseline incorporating the RM3 relevance feedback model to improve retrieval robustness.

* **Advanced Methods:** Beyond the baselines, three advanced strategies are implemented to handle the complexities of financial jargon and viewpoint diversity:
  * **Neural Query Expansion**: Utilizes a Large Language Model (LLM) as a "financial persona" to rewrite natural language queries into jargon-rich versions. This bridges the vocabulary mismatch between layperson users and technical financial documents.
  * **Cross-Encoder Re-ranking**: Implements a Neural Cross-Encoder that jointly encodes the query and document. This allows the model to capture deep, fine-grained semantic interactions that standard keyword matching might miss.
  * **Sentiment-Diversified Re-ranking**: A greedy re-ranking algorithm that processes the top-ranked documents through a sentiment classifier (labeling them as Positive, Negative, or Neutral). It then re-orders the results to ensure a "Zig-Zag" pattern of sentiments (e.g., Bullish vs. Bearish), providing a balanced set of opinions for the user

### Libraries


In [None]:
!pip install python-terrier
!pip install pyterrier_caching



In [None]:
import re
from sentence_transformers import CrossEncoder
import pyterrier as pt
from pyterrier_caching import RetrieverCache
import os
import shutil
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import warnings
import torch

### 1 - Download and Analysis of the collection


In [None]:
dataset = pt.get_dataset('irds:beir/fiqa/test')

In [None]:
for doc in dataset.get_corpus_iter():
    print(doc.keys())
    print(doc)
    break

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

dict_keys(['text', 'docno'])
{'text': "I'm not saying I don't like the idea of on-the-job training too, but you can't expect the company to do that. Training workers is not their job - they're building software. Perhaps educational systems in the U.S. (or their students) should worry a little about getting marketable skills in exchange for their massive investment in education, rather than getting out with thousands in student debt and then complaining that they aren't qualified to do anything.", 'docno': '3'}


### 1.1 - Queries

In [None]:
queries = dataset.get_topics()
display(queries)

Unnamed: 0,qid,query
0,4641,Where should I park my rainy-day / emergency f...
1,5503,Tax considerations for selling a property belo...
2,7803,Can the Delta be used to calculate the option ...
3,7017,Basic Algorithmic Trading Strategy
4,10152,What does a high operating margin but a small ...
...,...,...
643,4102,How can I determine if my rate of return is “g...
644,3566,Where can I buy stocks if I only want to inves...
645,94,Using credit card points to pay for tax deduct...
646,2551,How to find cheaper alternatives to a traditio...


### 1.2 - Qrels

In [None]:
qrels = dataset.get_qrels()
display(qrels)

Unnamed: 0,qid,docno,label,iteration
0,8,566392,1,0
1,8,65404,1,0
2,15,325273,1,0
3,18,88124,1,0
4,26,285255,1,0
...,...,...,...,...
1701,11039,330058,1,0
1702,11039,91183,1,0
1703,11054,155053,1,0
1704,11054,321015,1,0


In [None]:
# We visualize some base statistics
docs = list(dataset.get_corpus_iter())  # Loads all documents into memory
df = pd.DataFrame(docs)

df["text_length"] = df["text"].fillna("").apply(len)

print("\nSummary of document text sizes:\n")
print(df["text_length"].describe())

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]


Summary of document text sizes:

count    57638.000000
mean       767.210816
std        738.614266
min          0.000000
25%        329.000000
50%        522.000000
75%        935.000000
max      16990.000000
Name: text_length, dtype: float64


To facilitate a modular development workflow among the three team members, the project is structured into three distinct workloads. We have opted to present the complete implementation for each participant within this notebook. To ensure technical isolation and prevent variable name collisions between these parallel contributions, we utilize `jupyter-spaces` to maintain independent namespaces for each section.

In [None]:
!pip install jupyter-spaces
%load_ext jupyter_spaces



## 2 - Base experiments

### 2.1 - BM25

we will use  ` pt.index.IterDictIndexer() `, which has the following parameters:



```
### index_path (str) – Directory to store index. Ignored for IndexingType.MEMORY.
### text_attrs (List[str]) – List of columns/keys of the input data that should be indexed. These are concatenated in the document representation. Defaults to [“text”].
### meta (Dict[str, int]) – What metadata for each document to record in the index, and what length to reserve. Metadata values will be truncated to this length. Defaults to {“docno” : 20}.
### meta_reverse (List[str]) – What metadata should we be able to resolve back to a docid. Defaults to [“docno”]. Can it be the title for cross-encoders.
### fields (bool) – Whether a fields-indexer should be used, i.e. whether the frequency in each attribute should be recorded separately in the Terrer index. This allows application of weighting models such as BM25F.

```

We change the hyperparameters, namely `meta`, of the indexer as we will use BM25 also for the sentiment diversification pipeline



In [None]:
%%space mairani
index_path = "/content/fiqa_index"

# Remove any existing index folder before building again
if os.path.exists(index_path):
    shutil.rmtree(index_path)

# Select the correct index type
indexer = pt.index.IterDictIndexer(index_path, fields=['text'], meta={'docno':20, 'text':2000})  # Currently troncating text as the longest has 16990, but should suffice

'''
FinBERT is built on the BERT architecture, which is limited to processing a
maximum of 512 subword tokens at once. In English, 512 subword tokens
(WordPiece) typically translate to roughly 350–450 words. Given that the average
word length in English is approximately 5 characters plus a space, 512 tokens
effectively cover between 1,800 and 2,500 characters.
'''

# Start the document Indexing using the correct index type
indexref = indexer.index(dataset.get_corpus_iter())

# Reference to the index
index = pt.IndexFactory.of(indexref)
stats = index.getCollectionStatistics()
print("Index folder:", index_path)
print("Number of documents:", stats.getNumberOfDocuments())
print("Number of postings:", stats.getNumberOfPostings())
print("Number of tokens:", stats.getNumberOfTokens())
print("Number of unique terms:", stats.getNumberOfUniqueTerms())
print("Average document length:", stats.getAverageDocumentLength())


Java started (triggered by TerrierIndexer.__init__) and loaded: pyterrier.java.colab, pyterrier.java, pyterrier.java.24, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

18:05:53.525 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents
Index folder: /content/fiqa_index
Number of documents: 57638
Number of postings: 2714611
Number of tokens: 3783214
Number of unique terms: 51260
Average document length: 65.63749609632534


Alongside the statistics we can notice the indexer warned us about empty documents. Let us explore that

In [None]:
%%space mairani
# Printing the doc id for each empty document
target_docnos = set()

count = 0
print("--- Inspecting Potential Empty Docs ---")
for doc in dataset.get_corpus_iter():
    # Check if text is null or empty whitespace
    if not doc.get('text') or len(doc['text'].strip()) == 0:
        print(f"Found Empty Doc: {doc['docno']}")
        target_docnos.add(doc['docno'])
        count += 1

print('\n', count)

--- Inspecting Potential Empty Docs ---


beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

Found Empty Doc: 7915
Found Empty Doc: 12229
Found Empty Doc: 14135
Found Empty Doc: 33445
Found Empty Doc: 40982
Found Empty Doc: 54960
Found Empty Doc: 66248
Found Empty Doc: 115636
Found Empty Doc: 117276
Found Empty Doc: 126502
Found Empty Doc: 138418
Found Empty Doc: 153104
Found Empty Doc: 167128
Found Empty Doc: 169220
Found Empty Doc: 189754
Found Empty Doc: 206368
Found Empty Doc: 207473
Found Empty Doc: 215635
Found Empty Doc: 237392
Found Empty Doc: 248226
Found Empty Doc: 254541
Found Empty Doc: 290110
Found Empty Doc: 319894
Found Empty Doc: 325407
Found Empty Doc: 356535
Found Empty Doc: 358213
Found Empty Doc: 360125
Found Empty Doc: 399083
Found Empty Doc: 414219
Found Empty Doc: 441462
Found Empty Doc: 447488
Found Empty Doc: 470232
Found Empty Doc: 486245
Found Empty Doc: 528558
Found Empty Doc: 552533
Found Empty Doc: 572451
Found Empty Doc: 587667
Found Empty Doc: 597929

 38


In [None]:
%%space mairani
# Printing the integral text of each document
print(f"{'DocID':<15} | {'Raw Representation of Text'}")
print("-" * 60)

# Iterate and filter
for doc in dataset.get_corpus_iter():
    if doc['docno'] in target_docnos:
        # Repr() shows invisible characters like '\n' or ''
        print(f"{doc['docno']:<15} | {repr(doc['text'])}")

DocID           | Raw Representation of Text
------------------------------------------------------------


beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

7915            | ''
12229           | ''
14135           | ''
33445           | ''
40982           | ''
54960           | ''
66248           | ''
115636          | ''
117276          | ''
126502          | ''
138418          | ''
153104          | ''
167128          | ''
169220          | ''
189754          | ''
206368          | ''
207473          | ''
215635          | ''
237392          | ''
248226          | ''
254541          | ''
290110          | ''
319894          | ''
325407          | ''
356535          | ''
358213          | ''
360125          | ''
399083          | ''
414219          | ''
441462          | ''
447488          | ''
470232          | ''
486245          | ''
528558          | ''
552533          | ''
572451          | ''
587667          | ''
597929          | ''


We determined the documents are actually empty and do not contain hidden or unrecognised characters. Under normal circumstances we might decide to drop them, but given that for this specific project the performance are to be evalutated with respect to the same exact data for each group, we ignore the warning.

We can now move on to the indexing

In [None]:
%%space mairani
bm25 = pt.terrier.Retriever(indexref, wmodel="BM25", metadata=["docno", "text"])
queries_titles = dataset.get_topics('text')

# Removing problematic characters
queries_titles['query'] = queries_titles['query'].str.replace(':', ' ', regex=False)
queries_titles['query'] = queries_titles['query'].str.replace('"', '', regex=False)
queries_titles['query'] = queries_titles['query'].str.replace("'", '', regex=False)

# Metrics required for the project
metrics = [
    "P_1", "P_5", "P_10",
    "recall_5", "recall_10",
    "ndcg_cut_5", "ndcg_cut_10",
    "map"
]

res = pt.Experiment(
    [bm25],
    queries_titles,
    qrels,
    eval_metrics=metrics,
    names=["BM25"],
    verbose=True,
)

# Show final table
display(res)

pt.Experiment:   0%|          | 0/1 [00:00<?, ?system/s]

Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,BM25,0.210172,0.236111,0.106481,0.07037,0.247471,0.309708,0.2299,0.252347



### 2.2 - TF-IDF

In [None]:
%%space vella
# Setup path for the index
index_dir = './fiqa_index_tfidf'

if not os.path.exists(index_dir + "/data.properties"):

    indexer = pt.IterDictIndexer(
        index_dir,
        overwrite=True,
        meta={'docno': 20},
        fields=['text']
    )
    index_ref = indexer.index(dataset.get_corpus_iter())
else:
    index_ref = pt.IndexRef.of(index_dir + "/data.properties")

tfidf = pt.BatchRetrieve(index_ref, wmodel="TF_IDF")

topics = dataset.get_topics()

topics['query'] = topics['query'].str.replace(':', ' ', regex=False)
topics['query'] = topics['query'].str.replace('"', '', regex=False)
topics['query'] = topics['query'].str.replace("'", '', regex=False)

metrics = [
    "map",
    "ndcg_cut_5", "ndcg_cut_10",
    "P_1", "P_5", "P_10",
    "recall_5", "recall_10"
]

res = pt.Experiment([tfidf], topics, dataset.get_qrels(), eval_metrics=metrics, names=["TF-IDF"]
                    )

display(res)

Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,TF-IDF,0.209687,0.236111,0.107716,0.071605,0.249016,0.313724,0.230614,0.253674


### 2.3 - RM3

In [None]:
%%space santagati
# Get the corpus
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_path = './fiqa-index'

In [None]:
%%space santagati
# Remove any existing index folder before building again
if os.path.exists(index_path):
    shutil.rmtree(index_path)

In [None]:
%%space santagati
indexer = pt.index.IterDictIndexer(
    index_path,
    meta={'docno':20},
    fields=['text']
)

index_ref = indexer.index(dataset.get_corpus_iter())

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

18:07:28.755 [ForkJoinPool-2-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents


In [None]:
%%space santagati
topics = dataset.get_topics()
topics['query'] = topics['query'].str.replace(':', ' ', regex=False)
topics['query'] = topics['query'].str.replace('"', '', regex=False)
topics['query'] = topics['query'].str.replace("'", '', regex=False)
display(topics.head(10))

Unnamed: 0,qid,query
0,4641,Where should I park my rainy-day / emergency f...
1,5503,Tax considerations for selling a property belo...
2,7803,Can the Delta be used to calculate the option ...
3,7017,Basic Algorithmic Trading Strategy
4,10152,What does a high operating margin but a small ...
5,3451,Should you keep your stocks if you are too lat...
6,4804,How do financial services aimed at women diffe...
7,7911,What is the difference between a trader and a ...
8,10809,Definitions of leverage and of leverage factor
9,6715,What does it mean if “IPOs - normally are sold...


In [None]:
%%space santagati
retrieval_pipeline_rm3 = (pt.terrier.Retriever(index_ref, wmodel="BM25") >>
    pt.rewrite.RM3(index_ref) >>
    pt.terrier.Retriever(index_ref, wmodel="BM25")
)

retrieval_effectiveness = pt.Experiment(
    [retrieval_pipeline_rm3],
    topics,
    qrels,
    names=["BM25+RM3+BM25"],
    eval_metrics=[
        "map",
        "ndcg_cut_5", "ndcg_cut_10",
        "P_1", "P_5", "P_10",
        "recall_5", "recall_10"
    ]
)
display(retrieval_effectiveness)

Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,BM25+RM3+BM25,0.205944,0.220679,0.10679,0.067901,0.24223,0.302616,0.224297,0.245069


---

## 3 - Advanced experiments

### 3.1 - Cross-encoder re-ranking

Cross-Encoder Re-ranking
1. Candidate Generation: We run BM25 to retrieve the top 100 documents for a query.
2. Model: We load a pre-trained Cross-Encoder.
3. Scoring: For each of the 100 pairs, we pass them into the model to output a relevance score.
4. Sorting: We re-sort the 100 documents based on these new neural scores.
5. Output: We return the top 10 from this re-ranked list.


#### 3.1.1 - Re-ranking

In [None]:
%%space vella
if not pt.started():
    pt.init()

# SETUP, INDEXING
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_path = './fiqa_index_beir'

if os.path.exists(index_path):
    shutil.rmtree(index_path)

indexer = pt.index.IterDictIndexer(
    index_path,
    meta={'docno': 20, 'text': 4096}, #actual words for the cross-encoder
    fields=['text']
)
indexref = indexer.index(dataset.get_corpus_iter())

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

18:09:08.081 [ForkJoinPool-3-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents


In [None]:
%%space vella
# CLEANING THE QUERIES
topics = dataset.get_topics()

def clean_query(q):
    return re.sub(r'[^\w\s]', ' ', q)

topics['query'] = topics['query'].apply(clean_query)
# print("Queries cleaned. Example:", topics.iloc[0]['query'])

In [None]:
%%space vella
# COMPONENTS
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25") % 100

class CrossEncoderReranker(pt.Transformer):
    def __init__(self, model_name, batch_size=32):
        self.model = CrossEncoder(model_name)
        self.batch_size = batch_size

    def transform(self, df):
        pairs = list(zip(df['query'], df['text']))
        scores = self.model.predict(pairs, batch_size=self.batch_size, show_progress_bar=True)
        df['score'] = scores
        return df.sort_values(by=['qid', 'score'], ascending=[True, False])

# Using MS MARCO model
ce_model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
reranker = CrossEncoderReranker(ce_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
%%space vella
# PIPELINE and EXPERIMENT
metrics = [
    "P_1", "P_5", "P_10",
    "recall_5", "recall_10",
    "ndcg_cut_5", "ndcg_cut_10",
    "map"
]
pipeline = (bm25
            >> pt.text.get_text(dataset, "text")
            >> reranker)

results = pt.Experiment(
    [bm25, pipeline],
    topics,
    dataset.get_qrels(),
    names=["BM25", "BM25 + CrossEncoder"],
    eval_metrics=metrics,

)

display(results)

Batches: 0it [00:00, ?it/s]

Batches:   0%|          | 0/2025 [00:00<?, ?it/s]

Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,BM25,0.20864,0.236111,0.106481,0.07037,0.247471,0.309708,0.23006,0.252589
1,BM25 + CrossEncoder,0.290324,0.342593,0.154012,0.093519,0.351604,0.409834,0.33218,0.350305


#### 3.1.2 - Explanation of the Results

The system (`BM25 + CrossEncoder`) is beating the baseline (`BM25`) by a wide margin on every single metric.

* **NDCG@10 (0.25 → 0.35):** This is the most important metric for our "Top 10" goal.
* **Recip_Rank (0.32 → 0.43):** This measures how far down the user has to scroll to find the *first* correct answer. A score of 0.43 implies the first right answer is often in position 2 or 3, whereas 0.32 implies position 3 or 4.
* **P-values (e-19, e-23):** These numbers are scientifically "zero." It means there is effectively **0% chance** that this improvement happened by luck. The improvement is statistically significant.



#### Experimental Results

To validate the effectiveness of our proposed pipeline, we compared a standard keyword-based baseline (BM25) against our two-stage pipeline (BM25 + Cross-Encoder). The evaluation was conducted on the FiQA test collection, assessing the system's ability to retrieve and rank financial documents relevant to user queries.

**Quantitative Performance**
As shown in Table 1, the Cross-Encoder re-ranking stage yielded substantial performance gains across all tracked metrics compared to the BM25 baseline.

* **Relevance (NDCG@10):** The Normalized Discounted Cumulative Gain at 10, which measures the quality of the top-10 results shown to the user, increased from **0.253 to 0.350 (+38.7%)**. This directly validates our objective of improving the precision of the final ranked list.
* **Recall & Precision (MAP):** Mean Average Precision improved from **0.209 to 0.290 (+39.2%)**, indicating that the system is successfully bubbling up more relevant documents across the entire retrieval list.
* **User Effort (MRR/Recip Rank):** The Reciprocal Rank improved from **0.321 to 0.435**, suggesting that on average, the first relevant document appears significantly higher in the list (closer to rank 2 or 3) compared to the baseline.

**Statistical Significance and Robustness**
We performed a paired t-test to ensure the validity of these improvements. The **p-values** for all metrics were extremely low (), confirming that the performance boost is statistically significant and not due to chance.

**Conclusion on Re-ranking**
These results confirm that semantic re-ranking is a crucial component for Financial QA. The Cross-Encoder successfully captures nuances in financial language that keyword matching misses, providing a strong foundation for the subsequent "Sentiment-Diversified" stage of our project.

---

### 3.2 - Sentiment diversified re-ranking

In alignment with our project proposal, we aim to address the following research question:

> *To what extent can a sentiment-diversified re-ranking framework mitigate the formation of echo chambers in financial question answering systems without compromising retrieval relevance?*

To achieve this, we implement a two-stage pipeline:


*   **Sentiment Enrichment**: We annotate the top-$k$ documents retrieved by the BM25 baseline with sentiment labels (Bullish, Bearish, Neutral) using a domain-specific Large Language Model.
*  **Diversified Re-ranking**: We apply a "Soft Zig-Zag" re-ranking algorithm inspired by [Maximal Marginal Relevance](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf) (MMR). This algorithm balances relevance scores with sentiment diversity to disrupt potential "filter bubbles" in the result list.

We evaluate the success of this approach using [Shannon Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) to measure sentiment diversity (Echo Chamber Mitigation).

#### 3.2.1 - Sentiment Enrichment

To operationalize sentiment diversity, we first need to detect the latent sentiment of the financial documents. We utilize [FinTwitBERT-sentiment](https://huggingface.co/StephanAkkerman/FinTwitBERT-sentiment), a model fine-tuned on financial microblogs and news, to classify the text of retrieved documents. The choice came as a consequnce of the performance, or lack thereof, of other more general models, such as FinBERT, which systematically failed to label microblogs correctly, supposedly due to the difference in language.

 We construct a PyTerrier transformer pipeline that:
 * Retrieves the top $k$ documents using the BM25 baseline.
 * Passes the document text through the Hugging Face sentiment pipeline.
 * Appends sentiment labels (*Bullish, Bearish, Neutral*) and `sentiment_score` to the dataframe for downstream re-ranking.

In [None]:
%%space mairani
from transformers import pipeline

# Initialize the Financial Sentiment Classifier
sentiment_model = pipeline(
    "sentiment-analysis",
    model="StephanAkkerman/FinTwitBERT-sentiment",
    device=0,
    batch_size=32,
    torch_dtype=torch.float16
)

def add_sentiment_labels(df):
    if df.empty:
        return df

    # Extract document text for classification
    texts = df['text'].tolist()

    # Perform sentiment analysis
    results = sentiment_model(texts, truncation=True, max_length=512)

    # Map model labels
    df['sentiment'] = [res['label'] for res in results]
    df['sentiment_score'] = [res['score'] for res in results]

    return df

# Wrap the function in a PyTerrier Transformer
sentiment_enricher = pt.apply.generic(add_sentiment_labels)

`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda:0


In [None]:
%%space mairani
# Define pipeline with caching for increased performance
k_depth = 40
exp_pipeline = bm25 % k_depth >> sentiment_enricher

cache_folder_name = f"sentiment_cache_k{k_depth}"
enriched_pipeline = RetrieverCache(cache_folder_name, exp_pipeline)

To verify the sentiment tagging we printed the entire result without maximum column width. This allowed us to determine which mkodel worked and which didn't



```
pd.set_option('display.max_colwidth', None)
sample_results[['docno', 'score', 'sentiment', 'text']]
```



#### 3.2.2 - Baseline Performance and Diversity Analysis

Before applying our re-ranking algorithm, we establish a baseline to measure the extent of the "echo chamber" effect in the standard retrieval results. We utilize Shannon Entropy ($H$) as our primary metric for diversity. In this context, entropy quantifies the distribution of sentiments within the top-$k$ results for a given query. A low entropy suggests the results are dominated by a single sentiment (an echo chamber), while high entropy indicates a diverse mix of opinions. The entropy $H$ for a set of results $S$ is defined as:$$H(S) = - \sum_{i \in \{Pos, Neg, Neu\}} p_i \log_2(p_i)$$where $p_i$ is the probability (frequency) of sentiment $i$ appearing in the result list.

In [None]:
%%space mairani
# Enriches the docs retireved and connects them to the queries
baseline_results = enriched_pipeline.transform(queries_titles)
merged_results = baseline_results.merge(qrels, on=['qid', 'docno'], how='left')
merged_results['label'] = merged_results['label'].fillna(0)
display(merged_results[['qid', 'docno', 'score', 'rank', 'sentiment', 'label']].head())

Unnamed: 0,qid,docno,score,rank,sentiment,label
0,4641,376148,41.677305,0,BEARISH,0.0
1,4641,497993,29.149791,1,BULLISH,0.0
2,4641,580025,26.773005,2,BULLISH,0.0
3,4641,253614,26.640181,3,BULLISH,0.0
4,4641,32833,24.265187,4,BEARISH,0.0


In [None]:
%%space mairani
def calculate_shannon_entropy(df_group):
    counts = df_group['sentiment'].value_counts(normalize=True) # Probabilities p_i
    entropy = -sum(counts * np.log2(counts))
    return entropy

# Apply to every query group
entropy_per_query = merged_results.groupby('qid').apply(calculate_shannon_entropy)
mean_entropy = entropy_per_query.mean()

print(f"Baseline Mean Shannon Entropy (@{k_depth}): {mean_entropy:.4f}")

Baseline Mean Shannon Entropy (@40): 1.3549


#### 3.2.3 - Soft Zig-Zag Re-ranking Algorithm

To mitigate sentiment homogeneity, we implement a Soft Zig-Zag Re-ranker. Unlike a "hard" zig-zag that forces a strict alternation of sentiments (potentially retrieving non-relevant documents), our "soft" approach re-scores candidates based on a weighted combination of their original relevance and their contribution to the list's sentiment diversity. We define the score for a candidate document $d$ at rank $k$ as:$$Score(d) = (1 - \lambda) \cdot \text{Relevance}(d) + \lambda \cdot \text{Diversity}(d)$$
Where:
* $\text{Relevance}(d)$ is the normalized BM25 score.
* $\text{Diversity}(d)$ is the distance of the document's sentiment from the current average sentiment of the selected list
* $\lambda$ is a hyperparameter controlling the trade-off. We empirically selected $\lambda = 0.25$ to prioritize relevance while introducing sufficient perturbation to break homogenous clusters.

In [None]:
%%space mairani
def soft_zigzag_rerank(df_query, lambda_param=0.3, k=10):
    if len(df_query) <= 1:
        return df_query
    candidates = df_query.copy()

    # Pre-process Scores (Normalize BM25)
    min_score = candidates['score'].min()
    max_score = candidates['score'].max()
    if max_score == min_score:
        candidates['norm_score'] = 1.0
    else:
        candidates['norm_score'] = (candidates['score'] - min_score) / (max_score - min_score)

    # Map sentiments to numbers
    sent_map = {'BULLISH': 1.0, 'NEUTRAL': 0.0, 'BEARISH': -1.0}
    candidates['sent_val'] = candidates['sentiment'].str.upper().map(sent_map).fillna(0.0)

    selected_docs = []
    candidates_pool = candidates.index.tolist()

    # Always pick the rank #1 document first
    first_pick_idx = candidates.loc[candidates_pool, 'rank'].idxmin()
    selected_docs.append(first_pick_idx)
    candidates_pool.remove(first_pick_idx)

    # Pick the remaining k-1 documents
    for _ in range(min(k - 1, len(candidates_pool))):
        # Current Average Sentiment
        current_sent_values = candidates.loc[selected_docs, 'sent_val']
        avg_sent = current_sent_values.mean()

        # Diversity & Relevance Calculation
        pool_df = candidates.loc[candidates_pool]
        diversity_score = (pool_df['sent_val'] - avg_sent).abs() / 2.0
        relevance_score = pool_df['norm_score']

        final_scores = (1 - lambda_param) * relevance_score + lambda_param * diversity_score

        # Pick winner
        best_idx = final_scores.idxmax()
        selected_docs.append(best_idx)
        candidates_pool.remove(best_idx)

    # Construct Final Dataframe
    re_ranked_head = candidates.loc[selected_docs].copy()

    # Get the remaining documents (tail)
    if candidates_pool:
        remaining_tail = candidates.loc[candidates_pool].copy()
        # Combine them: Head first, Tail second
        final_df = pd.concat([re_ranked_head, remaining_tail])
    else:
        final_df = re_ranked_head

    # We assign scores after concatenating everything to mantain order
    final_df['score'] = range(len(final_df), 0, -1)
    final_df['rank'] = range(0, len(final_df))

    return final_df

# Wrapper
lambda_value = 0.25  # Found empirically through trial an error
soft_zigzag_pipeline = (
    enriched_pipeline
    >> pt.apply.generic(lambda df: df.groupby('qid').apply(lambda x: soft_zigzag_rerank(x, lambda_param=lambda_value)).reset_index(drop=True))
)

#### 3.2.4 - Experimental Evaluation

To validate our system, we conduct a comparative evaluation between the BM25 Baseline and the Soft Zig-Zag Re-ranker. The evaluation considers two conflicting objectives:
* Retrieval Effectiveness: Measured via MAP, P@k, and nDCG to ensure the system remains useful for financial question answering.
* Echo Chamber Mitigation: Measured via the percentage increase in Mean Shannon Entropy compared to the baseline.

In [None]:
%%space mairani
systems = [enriched_pipeline, soft_zigzag_pipeline]
names = ["BM25 Baseline", "Soft Zig-Zag"]

experiment_results = pt.Experiment(
    systems,
    queries_titles,
    qrels,
    eval_metrics=metrics,
    names=names,
)

display(experiment_results)

  warn(
  warn(


Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,BM25 Baseline,0.205818,0.236111,0.106481,0.07037,0.247471,0.309708,0.2299,0.252347
1,Soft Zig-Zag,0.203477,0.236111,0.10463,0.068673,0.244794,0.3087,0.227278,0.249531


In [None]:
%%space mairani
zigzag_results_df = soft_zigzag_pipeline.transform(queries_titles)

# Filter to Top-10 only so we can measure Diversity on what we can see
baseline_top_k = baseline_results.groupby("qid").head(10)
zigzag_top_k = zigzag_results_df.groupby("qid").head(10)

# Calculate Entropy for both
baseline_entropy = baseline_top_k.groupby('qid').apply(calculate_shannon_entropy).mean()
zigzag_entropy = zigzag_top_k.groupby('qid').apply(calculate_shannon_entropy).mean()

print(f"Baseline Mean Shannon Entropy: {baseline_entropy:.4f}")
print(f"Zig-Zag Mean Shannon Entropy:  {zigzag_entropy:.4f}")
improvement = ((zigzag_entropy - baseline_entropy) / baseline_entropy) * 100
print(f"Diversity Improvement: {improvement:.2f}%")

Baseline Mean Shannon Entropy: 1.1841
Zig-Zag Mean Shannon Entropy:  1.2583
Diversity Improvement: 6.26%


Now let's actually see the reranking

In [None]:
%%space mairani
# Pick the first query (chosen arbitrarily)
sample_qid = queries_titles['qid'].iloc[0]

display(baseline_top_k[baseline_top_k['qid'] == sample_qid][['docno', 'score', 'sentiment', 'text']].head(10))
display(zigzag_top_k[zigzag_top_k['qid'] == sample_qid][['docno', 'score', 'sentiment', 'text']].head(10))

Unnamed: 0,docno,score,sentiment,text
0,376148,41.677305,BEARISH,Bond aren't necessarily any safer than the sto...
1,497993,29.149791,BULLISH,Duffbeer703 covers most everything. The entire...
2,580025,26.773005,BULLISH,"""I don't know Canada very well, but can offer ..."
3,253614,26.640181,BULLISH,"""Liquid cash (emergency, rainy day fund) shoul..."
4,32833,24.265187,BEARISH,In addition to the issues discussed in BrenBar...
5,386305,23.542073,BULLISH,Thank you for your service. My first suggesti...
6,583695,22.926576,BEARISH,"""The above answers are great. I would only add..."
7,282623,22.765005,BULLISH,"""I'm not a fan of using cash for """"emergency""""..."
8,416189,22.683605,BULLISH,"For me there are two issues. So, what to do? ..."
9,358686,21.944761,BULLISH,"""Look, as my final comment. You're overthinki..."


Unnamed: 0,docno,score,sentiment,text
11120,376148,40,BEARISH,Bond aren't necessarily any safer than the sto...
11121,497993,39,BULLISH,Duffbeer703 covers most everything. The entire...
11122,580025,38,BULLISH,"""I don't know Canada very well, but can offer ..."
11123,32833,37,BEARISH,In addition to the issues discussed in BrenBar...
11124,253614,36,BULLISH,"""Liquid cash (emergency, rainy day fund) shoul..."
11125,583695,35,BEARISH,"""The above answers are great. I would only add..."
11126,386305,34,BULLISH,Thank you for your service. My first suggesti...
11127,282623,33,BULLISH,"""I'm not a fan of using cash for """"emergency""""..."
11128,416189,32,BULLISH,"For me there are two issues. So, what to do? ..."
11129,108978,31,BEARISH,Stop trying to make money with your emergency ...


---

### 3.3 - Query Expansion
### 3.3 - Query Expansion

In this experiment, we explore LLM-based query expansion using Microsoft Phi-3-mini-4k-instruct to generate semantically meaningful keywords that augment the original queries.

Unlike traditional probabilistic methods like RM3, which extract expansion terms from pseudo-relevant documents, our approach leverages the vast pre-trained knowledge of a language model to identify contextually appropriate financial terminology. This allows the system to suggest relevant concepts and synonyms that may not appear in initially retrieved documents, potentially bridging vocabulary gaps between user queries and expert financial content.

We compare this neural expansion approach against the RM3 baseline to evaluate whether semantic understanding from pre-trained models can outperform statistical term extraction.

Additionally, this query expansion method could complement diversity-focused techniques in our broader research goal of mitigating echo chambers in Financial QA. Improved query expansion could strengthen the initial retrieval stage, providing higher-quality candidate documents for subsequent diversity-focused re-ranking methods to work with.

In [None]:
!pip install pyterrier[java]
!pip install transformers



In [None]:
%%space santagati
import pyterrier as pt
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from tqdm import tqdm

In [None]:
%%space santagati
# Get the corpus
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_path = './fiqa-index'

# Remove any existing index folder before building again
if os.path.exists(index_path):
    shutil.rmtree(index_path)

indexer = pt.index.IterDictIndexer(
    index_path,
    meta={'docno':20},
    fields=['text']
)

index_ref = indexer.index(dataset.get_corpus_iter())

beir/fiqa/test documents:   0%|          | 0/57638 [00:00<?, ?it/s]

18:16:24.522 [ForkJoinPool-4-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents


In [None]:
%%space santagati
topics = dataset.get_topics()
topics['query'] = topics['query'].str.replace(':', ' ', regex=False)
topics['query'] = topics['query'].str.replace('"', '', regex=False)
topics['query'] = topics['query'].str.replace("'", '', regex=False)
display(topics.head(10))

Unnamed: 0,qid,query
0,4641,Where should I park my rainy-day / emergency f...
1,5503,Tax considerations for selling a property belo...
2,7803,Can the Delta be used to calculate the option ...
3,7017,Basic Algorithmic Trading Strategy
4,10152,What does a high operating margin but a small ...
5,3451,Should you keep your stocks if you are too lat...
6,4804,How do financial services aimed at women diffe...
7,7911,What is the difference between a trader and a ...
8,10809,Definitions of leverage and of leverage factor
9,6715,What does it mean if “IPOs - normally are sold...


In [None]:
%%space santagati
qrels = dataset.get_qrels()
display(qrels.sample(5))

Unnamed: 0,qid,docno,label,iteration
647,3789,571131,1,0
1352,7758,574327,1,0
1177,6525,98150,1,0
147,1150,353369,1,0
1032,5862,562511,1,0


Load the model

In [None]:
%%space santagati
model_id = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

`torch_dtype` is deprecated! Use `dtype` instead!


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda:0


The instruction-based model is perfect to create a clear promt that can already output a correctly formatted sequence of keywords to add to the original query.

In [None]:
%%space santagati
def expand_query_with_llm(
    query: str,
    max_new_tokens: int = 35

):
    """Generate expanded query text."""
    prompt = [
        {"role": "system", "content": "You are a helpful financial assistant that generates relevant keywords."},
        {"role": "user", "content": f"Generate exactly 10 financial keywords related to this query, separated by commas: {query}\nKeywords:"}
    ]

    output = pipe(
        prompt,
        max_new_tokens=max_new_tokens,
        return_full_text=False,
        temperature=0.6,
    )

    expanded = output[0]["generated_text"].strip()

    # cleaning
    expanded = expanded.replace("\n", " ").strip()

    return expanded

def apply_expansion_and_rename(df):
    new_queries = []

    for q in tqdm(df["query"], desc="Expanding queries"):
        try:
            expanded = expand_query_with_llm(q)
        except Exception:
            expanded = q

        # Build additive expansion
        new_query = f"{q} {expanded}"
        new_queries.append(new_query)

    return pd.DataFrame({
        "qid": df["qid"],
        "query": new_queries
    })

In [None]:
%%space santagati
# Expanded queries
expanded_queries = apply_expansion_and_rename(topics)

display(topics)
display(expanded_queries)

Expanding queries:   2%|▏         | 10/648 [00:20<21:11,  1.99s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Expanding queries: 100%|██████████| 648/648 [20:25<00:00,  1.89s/it]


Unnamed: 0,qid,query
0,4641,Where should I park my rainy-day / emergency f...
1,5503,Tax considerations for selling a property belo...
2,7803,Can the Delta be used to calculate the option ...
3,7017,Basic Algorithmic Trading Strategy
4,10152,What does a high operating margin but a small ...
...,...,...
643,4102,How can I determine if my rate of return is “g...
644,3566,Where can I buy stocks if I only want to inves...
645,94,Using credit card points to pay for tax deduct...
646,2551,How to find cheaper alternatives to a traditio...


Unnamed: 0,qid,query
0,4641,Where should I park my rainy-day / emergency f...
1,5503,Tax considerations for selling a property belo...
2,7803,Can the Delta be used to calculate the option ...
3,7017,Basic Algorithmic Trading Strategy Keywords: a...
4,10152,What does a high operating margin but a small ...
...,...,...
643,4102,How can I determine if my rate of return is “g...
644,3566,Where can I buy stocks if I only want to inves...
645,94,Using credit card points to pay for tax deduct...
646,2551,How to find cheaper alternatives to a traditio...


In [None]:
%%space santagati
retrieval_pipeline_bm25 = (pt.terrier.Retriever(index_ref, wmodel="BM25"))
retrieval_effectiveness = pt.Experiment(
    [retrieval_pipeline_bm25],
    expanded_queries[['qid','query']],
    qrels,
    names=["BM25+Microsoft-Phi-3-Expansion"],
    eval_metrics=["map", "ndcg_cut_5", "ndcg_cut_10", "P_1", "P_5", "P_10", "recall_5", "recall_10"]
)

display(retrieval_effectiveness)

Unnamed: 0,name,map,P_1,P_5,P_10,recall_5,recall_10,ndcg_cut_5,ndcg_cut_10
0,BM25+Microsoft-Phi-3-Expansion,0.170959,0.175926,0.092593,0.060802,0.216294,0.277327,0.189278,0.2104


LLM-based query expansion is outperformed by RM3 across all metrics. The performance gap is largest in top-rank positions, with RM3 showing 21% better P@1, which is critical for question-answering tasks where users expect immediate accurate results.

The advantage of RM3 comes from extracting terms from actual retrieved documents ensures keywords match the collection's terminology. In contrast, Phi-3's general knowledge produces less effective terms for this specific financial corpus, likely due to semantic drift.

## 4 - Discussion and conclusion

### 4.1 - Cross encoder
The addition of the Cross-Encoder reranker consistently outperforms the BM25 baseline because it moves beyond simple keyword matching to capture semantic relationships. This system combines the efficiency of sparse retrieval with the precision of dense ranking, significantly boosting metrics like NDCG@10 and MAP.
However, this method introduces a substantial computational bottleneck, as the Cross-Encoder must process every query-document pair individually, making it much slower than the initial retrieval. Additionally, there is a risk of domain mismatch; the model is pre-trained on the MS MARCO dataset (general web passages), which may not perfectly align with the specialized financial terminology found in the FIQA dataset. Furthermore, the model likely has a token limit, meaning that despite the code storing 4096 characters, the reranker may truncate the end of longer financial documents, potentially losing critical information located later in the text.


### 4.2 - Sentiment diversification

Our experimental results demonstrate a classic trade-off inherent to diversified information retrieval, often observed in Maximal Marginal Relevance (MMR) frameworks.

Key Findings:
* **Relevance Preservation**: The Soft Zig-Zag approach ($\lambda=0.25$) maintained a Mean Average Precision (MAP) of 0.2032, a negligible decrease from the baseline's 0.2058. This indicates that the system continues to retrieve highly relevant financial information.
* **Diversity Gain**: We observed a 6.26% improvement in Shannon Entropy (from 1.184 to 1.258). This confirms that the re-ranker successfully injected diverse viewpoints into the top results, mitigating the filter bubble effect without "breaking" the retrieval quality.

It is worth noting that the modest entropy increase reflects a refinement of an already pluralistic system. With a high baseline of 1.18 bits, the retriever was already operating at approximately 75% of the theoretical maximum diversity ($H_{max}=1.585$). Unlike personalization-driven Recommender Systems (RS) that frequently trigger "Filter Bubbles" through iterative feedback loops, Information Retrieval (IR) is anchored by objective query relevance, making it structurally more robust against extreme sentiment isolation. Consequently, we can conclude that the modest gain indicates that our framework effectively disrupts sentiment homogeneity while maintaining the system's primary utility as a reliable financial discovery tool.



### 4.3 - Query expansion

The LLM-based query expansion experiment produced lower retrieval effectiveness than traditional RM3 expansion, likely due to the model lacking specific domain knowledge.

Given these results, we would not recommend integrating LLM-based expansion into a broader echo chamber mitigation pipeline, as it introduces complexity without enhancing the core objective.

A similar approach to this problem could include a model that has been fine-tuned for FinancialQA, which could create more semantically significant expansion terms.

## 5 - References

### 5.1 - Cross encoder
Model: MS MARCO MiniLM: From [HuggingFace](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2)

### 5.2 - Sentiment diversification
```
@misc{FinTwitBERT-sentiment,
  author = {Stephan Akkerman, Tim Koornstra},
  title = {FinTwitBERT-sentiment: A Sentiment Classifier for Financial Tweets},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/StephanAkkerman/FinTwitBERT-sentiment}}
}
```



### 5.3 - Query expansion
Microsft-Phi-3: From [HuggingFace](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)