Change this to use 

llmware/rag_instruct_test_dataset2_financial_0.1

where we have questions and expected answers for evaluation.

In [None]:
from datasets import load_from_disk, Dataset
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


# Semantic Retrieval

Semantic retrieval is the process of searching and ranking information based on meaning rather than exact keyword matches. 
Traditional search systems like TF-IDF rely heavily on shared vocabulary—documents are considered relevant only if they contain the same words as the query. 
Semantic retrieval goes further by understanding context, relationships between words, and intent behind a query.

Using modern language models and vector representations, semantic retrieval converts text into embeddings—numerical representations of meaning. 
Instead of matching tokens, it measures similarity in conceptual space, enabling results like matching ```"electric vehicles"``` with ```"automakers invest in EVs"``` or ```"digital currency regulations"``` with ```"cryptocurrency policy updates"```.

This shift leads to more accurate search results, greater robustness to synonyms and paraphrasing, and more intuitive answers—especially in domains where wording varies widely, such as finance, law, or customer support.

## Dataset and Search

The dataset used in this task is a subset of a [collection of financial news articles](https://huggingface.co/datasets/Brianferrell787/financial-news-multisource). 
The corresponding text embeddings have already been generated and included in this repository. 
As a result, once an instance of the class below is initialized, it can immediately be used to query the dataset.

The semantic search process works as follows:

1. The search query is embedded using the same model that was used to embed the dataset.
2. The query embedding is compared against all document embeddings to find the top-k most similar items. As with TF-IDF search, cosine similarity is used to measure relevance.
3.The most relevant results are returned to the user.

In [2]:
class InformationRetrieval:
    """
    Handles the retreival of news articles from the corpus.

    Attributes:
        ds (Dataset): The news data corpus.
        embedding_model (SentenceTransformer): Embedding model for the text.
    """
    def __init__(self) -> None:
        """
        Initialize a new news data retreiver.
        """
        self.ds = load_from_disk("data/financial-news-dataset")

        self.embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', token=False)
        self.load_index()

    def load_index(self, index_file: str = "my_index.faiss") -> None:
        """
        Load the index from a local file.

        Args:
            index_file (str): Path for the index file.
        """
        self.ds.load_faiss_index('embeddings', 'data/my_index.faiss')
        pass

    def search(self, query: str, top_k: int = 3) -> list[str]:
        """
        Search the corpus for the most similar articles

        Args:
            query (str): Search query.
            top_k (int): Number of articles returned.

        Returns:
            list[str]: List of most similar articles.
        """
        query_embedding = self.embedding_model.encode(query)

        scores, retrieved_examples = self.ds.get_nearest_examples('embeddings', query_embedding, k=top_k)
        return retrieved_examples["text"]

In [4]:
retriever = InformationRetrieval()
retriever.search("What news are there on Ted Cruz?")

['Carly Fiorina: Ted Cruz says \'whatever\' to get elected\n\nWashington (CNN)Carly Fiorina, herself recently accused of pandering on Twitter, chided Ted Cruz on Sunday, saying he "says whatever he needs to say to get elected." Fiorina, the former Hewlett-Packard executive and Cruz\'s rival for the GOP nomination, hit Cruz in an interview with CNN\'s Dana Bash on "State of the Union." She kept up her criticism of the Texas senator for his 2013 push for a government shutdown in an ill-fated attempt to repeal President Barack Obama\'s signature health care law. "Ted Cruz is just like any other politician. He says one thing in Manhattan, he says another thing in Iowa," Fiorina said Sunday. Recordings of Cruz speaking about gay marriage to donors in New York City appear to differ in style, but not substance, from his speeches to conservative supporters. "He says whatever he needs to say to get elected, and then he\'s going to do as he pleases," she said. "I think people are tired of a poli

### References

Ferrell, B. (2025). *financial-news-multisource* (Revision b509ef6) [Data set]. Hugging Face.  
DOI: 10.57967/hf/6432  
Available: https://huggingface.co/datasets/Brianferrell787/financial-news-multisource