# Leveraging Sparse Embeddings for Financial Reddit Analysis

In this example, we'll build an agent that can search through Reddit posts about stocks and financial markets
using sparse embeddings. Unlike dense embeddings that capture overall semantic meaning, sparse embeddings excel at
capturing specific keywords and rare terms - perfect for financial contexts where stock tickers (like AAPL, TSLA)
and company names need to be precisely matched.

## Why Sparse Embeddings for Financial Data?

Financial discussions on platforms like Reddit often contain:
1. Stock tickers (PLTR, NVDA) that need exact matching
2. Company names that might be misspelled (Palanteer vs. Palantir)
3. Industry-specific jargon that dense embeddings might miss

Sparse embeddings shine here because they:
- Maintain high precision for exact term matches (crucial for tickers)
- Preserve term frequency information (how often a stock is mentioned)
- Allow for better retrieval of documents with rare but important terms

Let's see how we can implement this approach with Pinecone's sparse vector database.

First, we install the prerequisites:

In [1]:
!pip install -qU \
  langchain==0.3.25 \
  langchain-community==0.3.24 \
  langchain-pinecone==0.2.6 \
  langchain-openai==0.3.18 \
  datasets==3.6.0

[0m

## Data Prep

We will use a dataset of posts from various finances subreddits, like `r/stocks`, `r/finance`, `r/pennystocks`, and `r/wallstreetbets`.

In [1]:
from datasets import load_dataset

posts = load_dataset("aurelio-ai/reddit-finance", split="train")
posts

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['id', 'subreddit', 'title', 'selftext'],
    num_rows: 107
})

In [2]:
posts[0]

{'id': '1j0w73o',
 'subreddit': 'stocks',
 'title': 'Rate My Portfolio - r/Stocks Quarterly Thread March 2025',
 'selftext': "Please use this thread to discuss your portfolio, learn of other stock tickers &amp; portfolios like [Warren Buffet's](https://buffett.online/en/portfolio/), and help out users by giving constructive criticism.\n\nWhy quarterly?  Public companies report earnings quarterly; many investors take this as an opportunity to rebalance their portfolios.  We highly recommend you do some reading:  Check out our wiki's list of [relevant posts &amp; book recommendations.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_posts.2C_books.2C_wiki_recommendations)\n\nYou can find stocks on your own by using a scanner like your broker's or [Finviz.](https://finviz.com/screener.ashx)  To help further, here's a list of [relevant websites.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_websites.2Fapps)\n\nIf you don't have a broker yet, see our [list of brokers]

We add the post links like so:

In [3]:
def add_permalink(x):
    x["permalink"] = f"https://reddit.com/r/{x['subreddit']}/comments/{x['id']}"
    return x

posts = posts.map(add_permalink)

## Sparse Vector Index Setup

For our Reddit stocks analysis, we need a specialized vector index that supports sparse vectors.
Unlike dense vectors (which might be 1024 dimensions with most values being non-zero),
sparse vectors can have millions of dimensions but only a few non-zero values.

This sparse representation is perfect for capturing the presence of specific terms like "NVDA" or "earnings call"
without diluting their importance in a dense representation.

## Sparse Embeddings Model

We'll use Pinecone's specialized sparse embedding model that's designed to:
1. Identify important financial terms and stock tickers
2. Create sparse vector representations that emphasize these key terms
3. Enable high-precision matching for financial queries

In [4]:
import os
from getpass import getpass
from pinecone import Pinecone

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") \
    or getpass("Enter your Pinecone API key: ")

pc = Pinecone()

Enter your Pinecone API key: ··········


In [5]:
from pinecone import ServerlessSpec, CloudProvider, AwsRegion, Metric

index_name = "rag-sparse-query-expansion"

if not pc.has_index(name=index_name):
    pc.create_index(
        name=index_name,
        metric=Metric.DOTPRODUCT,
        dimension=None,
        spec=ServerlessSpec(
            cloud=CloudProvider.AWS,
            region=AwsRegion.US_EAST_1
        ),
        vector_type="sparse",
    )

index = pc.Index(name=index_name)

Now we initialize our sparse embedding model:

In [6]:
from langchain_pinecone.embeddings import PineconeSparseEmbeddings

sparse_embeddings = PineconeSparseEmbeddings(
    model="pinecone-sparse-english-v0", show_progress_bar=True
)

We bring these together to initialize our LangChain `PineconeSparseVectorStore` object.

In [7]:
from langchain_pinecone import PineconeSparseVectorStore

vector_store = PineconeSparseVectorStore(
    index=index, embedding=sparse_embeddings
)

Now we index everything. We will be embedding both the post title and text content, this helps us capture greater context than just embedding one or the other.

In [9]:
from tqdm.auto import tqdm

# we add records in batches of 100 - for large datasets this is essential to
# avoid excessively large requests or out-of-memory errors
batch_size = 100

for i in tqdm(range(0, len(posts), batch_size)):
    # grab batch
    end_idx = min(i + batch_size, len(posts))
    batch = posts.select(range(i, end_idx))
    # merge title and content before embedding
    title_and_content = [
        f"## {post['title']}\n\n{post['selftext']}" for post in batch
    ]
    # prep our metadata
    metadata_batch = [
        {
            "subreddit": post["subreddit"],
            "permalink": post["permalink"]
        } for post in batch
    ]
    # add our data to the vec store
    vector_store.add_texts(
        texts=title_and_content,
        metadatas=metadata_batch[i:end_idx]
    )

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

## Query Expansion: Bridging ticker and company names for financial search

One of the biggest challenges in financial search is the vocabulary mismatch problem:
- Users might search for "Palanteer" when they mean "Palantir" (PLTR)
- They might use "NVDA stock drop" when relevant posts mention "Nvidia shares plummeting"
- They might search for a company name when Reddit posts use only the ticker symbol

To solve this, we'll implement query expansion - a technique that transforms a single query
into multiple alternative queries that capture the same intent but with different terminology.

We'll be using OpenAI for this part, for which we need an [API key](https://platform.openai.com) which we enter below:

In [10]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") \
    or getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


Now we define our LLM that will be used _within_ our query expansion tool to create our expanded queries. We can use `with_structured_output` to force a particular output format.

In [11]:
from pydantic import BaseModel
from langchain_openai import ChatOpenAI

# the pydantic BaseModel defines the schema we want for our LLM output
class QueryArray(BaseModel):
    queries: list[str]

# LLM for query expansion with high temp for increased "creativity"
# and structured output
expansion_llm = ChatOpenAI(
    name="gpt-4.1-mini", temperature=0.7
).with_structured_output(QueryArray)



Next we define the prompt that explains to our `expansion_llm` what it's purpose is and how it should tackle it's query expansion task.

In [12]:
expand_query_prompt = (
    "You are an expert at expanding search queries to improve search "
    "results. Given a query about stocks, companies, or financial "
    "markets, generate {num_expansions} alternative queries that "
    "capture the same intent but use different wording, synonyms, or "
    "related concepts.\n"
    "For company names, include both the company name and ticker "
    "symbol when possible. For misspelled company names, include the "
    "correct spelling in the alternatives. For general topics, i.e. "
    '"tech company performance", expand these queries to specific '
    'companies you know of, like "Google", "Apple", "Nvidia".\n'
    'For example, a query about Google could expand to: ["GOOGL", '
    '"Alphabet"].\n'
    "Another example, a query about FAANG company performance could "
    'expand to: ["META performance", "GOOGL performance", "AAPL '
    'performance"] and so on.\n"'
    "Original query: {query}"
)

We can run the LLM directly like so:

In [13]:
# get response from LLM
response = expansion_llm.invoke(expand_query_prompt.format(
    query="nvidia stonks",
    num_expansions=5
))
response

QueryArray(queries=['NVIDIA stocks', 'NVDA stonks', 'NVIDIA stock performance', 'NVIDIA ticker symbol', 'NVIDIA financial performance'])

However, we'd like this query expansion functionality to be an optional _tool_ that our main agent can call. For this, we define the `expand_query` tool. The main agent will decide whether to use this tool (or not) based on the instructions we provide in the function docstring below:

In [37]:
from langchain_core.tools import tool

@tool
def expand_query(query: str, num_expansions: int = 5) -> list[str]:
    """
    Expand a search query into multiple alternative queries to improve search
    results. Ideally you must capture the broader scope of queries beyond the
    provided query.
    This is useful when:
    - The original query is vague or ambiguous
    - You need to find synonyms or related terms
    - You want to cover broader aspects of the topic
    - The query contains misspelled company names or stock tickers

    For example, a query like "Tesla stock drop" could be expanded to:
    - "TSLA share price decrease"
    - "Tesla Motors stock market decline"
    - "EV companies market decline"
    - "BYD stock performance"
    - "Elon Musk company stock performance"
    To get a broader yet relevant scope of results. Diversity of results are
    important.

    Args:
        query: The original query that needs to be expanded
        num_expansions: Number of alternative queries to generate

    Returns:
        A list of expanded queries including the original query
    """
    try:
        # get response from LLM
        response = expansion_llm.invoke(expand_query_prompt.format(
            query=query,
            num_expansions=num_expansions
        ))
        # return the expanded queries
        return response.queries
    except Exception as e:
        return [f"Error expanding query: {str(e)}"]

## Searching Reddit with Sparse Vectors

Now that we have our query expansion mechanism, let's create a tool that leverages both
sparse embeddings and query expansion to search through Reddit posts about stocks.

The magic happens in the combination:
1. Query expansion generates multiple versions of the search query
2. Each expanded query is converted to a sparse vector representation
3. Pinecone's sparse similarity search finds the most relevant posts
4. Results are deduplicated and ranked by relevance

In [57]:
from typing import Any
from langchain_core.documents import Document

@tool
def reddit_stock_mentions(
    queries: list[str], max_results: int = 10, **kwargs
) -> list[dict[str, Any]]:
    """
    Search through Reddit posts from finance subreddits like r/wallstreetbets,
    r/investing, and r/pennystocks to find mentions of stocks, companies, or
    market trends.

    When using this tool:
    - For company searches, include both the company name and ticker symbol if
      known (e.g., "Tesla TSLA")
    - For misspelled company names, provide the correct spelling
    - Consider alternative terms that might be used on Reddit (e.g., "$PLTR"
      instead of just "Palantir")
    - Be specific about the type of information you're looking for (e.g.,
      "NVIDIA earnings reaction")

    Args:
        query: search query to find relevant posts about stocks or companies
        max_results: how many posts to return

    Returns:
        A list of relevant Reddit posts with metadata
    """
    try:
        seen_docs = set()  # To track unique documents
        docs: list[tuple[Document, float]] = []

        # Perform the search using sparse vectors
        # This is where the magic happens - sparse vectors excel at matching
        # specific terms like stock tickers and company names
        for query in queries:
            docs.extend(
                vector_store.similarity_search_with_score(
                    query=query, k=max_results
                )
            )

        unique_docs: list[dict] = []

        # Add unique results
        for doc, score in docs:
            # Create a unique identifier for the document
            doc_id = f"{doc.metadata.get('permalink', '')}"

            # Only add if we haven't seen this document before
            if doc_id not in seen_docs:
                seen_docs.add(doc_id)
                unique_docs.append({
                    "content": doc.page_content,
                    "subreddit": doc.metadata.get("subreddit", "unknown"),
                    "score": score,
                    "url": doc.metadata.get("permalink", ""),
                    "matched_query": query,
                })

        # Sort by score (if available) and limit to max_results
        unique_docs.sort(key=lambda x: x.get("score", 0), reverse=True)
        return unique_docs[:max_results]

    except Exception as e:
        return [{"error": f"Error searching Reddit mentions: {str(e)}"}]


## Putting It All Together: The Reddit Stocks Agent

Now we'll create an agent that orchestrates our tools to help users find and analyze
what Reddit is saying about stocks and companies. The agent will:

1. Use query expansion to handle misspellings and terminology variations
2. Leverage sparse embeddings for precise matching of financial terms
3. Analyze and summarize the Reddit sentiment about stocks

This approach combines the best of both worlds:
- LLMs for understanding user intent and generating alternative queries
- Sparse vector search for high-precision retrieval of financial discussions

We start with setting up our system prompt and chat prompt template, which will include placeholders for chat history _and_ intermediate steps where the agent will be able to store information returned from it's various tools before responding to the user.

In [58]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a system prompt that emphasizes using query expansion
system_prompt = (
    "You are a financial research assistant that helps users find information "
    "about stocks and companies from Reddit discussions.\n\n"
    "ALWAYS use the expand_query tool first when searching for information. "
    "This is critical because:\n"
    "1. It helps find alt terms and synonyms that Reddit users might use\n"
    "2. It corrects misspelled company names or stock tickers\n"
    "3. It improves search coverage by generating multiple related queries\n\n"
    "After expanding the query, use the reddit_stock_mentions tool with the "
    "expanded queries to find relevant posts.\n"
    "Remember to analyze the results and provide a comprehensive summary of "
    "what Reddit users are saying about the stock or company in question."
)

# Create the prompt template with message history
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

Now we setup chat memory, we'll use a simple conversation buffer (ie save everything).

In [59]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

Now bring all of this together to create our agent:

In [60]:
from langchain_core.runnables.base import RunnableSerializable
from langchain.agents import create_openai_tools_agent, AgentExecutor

llm = ChatOpenAI()
tools = [expand_query, reddit_stock_mentions]

agent = create_openai_tools_agent(
    llm=llm, tools=tools, prompt=prompt
)

# package into an agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True,
)

## Testing Our Agent with Real-World Queries

Let's put our agent to the test with some realistic queries that demonstrate its capabilities:
1. Handling misspelled company names (Palanteer → Palantir)
2. Finding discussions about specific events (NVIDIA earnings)
3. Exploring broader topics (biotech companies)

The agent will:
- Expand each query to cover different terminology
- Search using sparse vectors for precise matching
- Analyze and summarize the Reddit sentiment

In [61]:
from IPython.display import Markdown, display

query = "What happened to semiconductor companies' stock recently?"

# run the agent with the query
result = agent_executor.invoke({"input": query})

display(Markdown(result['output']))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `expand_query` with `{'query': 'semiconductor companies stock recent news'}`


[0m[36;1m[1;3m['semiconductor companies stock recent news', 'semiconductor industry stock updates', 'latest news on semiconductor companies', 'semiconductor sector stock news', 'chipmakers stock recent developments'][0m[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'queries': ['semiconductor companies stock recent news', 'semiconductor industry stock updates', 'latest news on semiconductor companies', 'semiconductor sector stock news', 'chipmakers stock recent developments'], 'max_results': 10}`


[0m[33;1m[1;3m[{'content': "## RHCO big news out today.. Finalized ACQUISITION of Morrich Lottery Limited and secured LOTTERY, SPORTSBOOK, and CASINO LICENSES\n\nThis stock has very little volume and its a small float. But looks good with their current financials and this recent completed acquisition to add more to the books.\n\n

Reddit users have discussed various topics related to recent developments in the semiconductor industry, as follows:

1. **RHCO Acquisition News:** There was a mention of RHCO finalizing the acquisition of Morrich Lottery Limited and securing lottery, sportsbook, and casino licenses. The stock has low volume and a small float, but its financials seem promising, with $5.7 million in revenue last quarter and other positive aspects.

2. **SS Innovations Uplisting to Nasdaq:** SS Innovations, a developer of surgical robotic technologies, announced its uplisting to the Nasdaq Capital Market. The company's recent achievements in the robotic surgery field were highlighted, including revenue growth and global expansion plans.

3. **Google's Earnings and Cloud Services:** Reddit users discussed Alphabet's (Google's parent company) upcoming earnings report and the focus on the company's cloud services. There was anticipation of strong revenue growth, cost management challenges, and the impact of tariffs and geopolitical uncertainty on Alphabet's operations.

4. **Airlines and Tourism Impact:** Posts also talked about the impact of a rapid drop in domestic demand on airlines and related industries like real estate, banks, and healthcare. The discussion highlighted how the travel crisis and reduced tourism could affect various companies and sectors.

5. **M&A Speculation on LXRX:** There was a discussion on why Reddit users believed that LXRX might be in negotiations for a buyout. Various indicators, such as changes in leadership, strategic moves, and licensing agreements, pointed towards a potential acquisition.

6. **Gold Stocks Performance:** Users discussed the recent increase in gold prices, reaching $3,300 per ounce, and how this has led to a surge in interest in gold mining companies. Different types of gold stocks, including major producers, junior miners, and streaming companies, were analyzed for their performance and potential.

Overall, Reddit discussions covered a range of topics including company acquisitions, stock uplistings, earnings reports, industry impacts, M&A speculation, and gold stocks performance in relation to recent developments in the semiconductor sector.

## How does our search perform without query expansion?

Let's see what we'd be missing out on if we weren't using query expansion in our search.

We'll set up another agent that doesn't have access to the query expansion tool.

In [62]:
memory2 = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# this time initialize with ONLY the reddit_stock_mentions
agent_wo_expansion = create_openai_tools_agent(
    llm=llm, tools=[reddit_stock_mentions], prompt=prompt
)

# package into another agent executor
agent_executor = AgentExecutor(
    agent=agent_wo_expansion,
    tools=tools,
    memory=memory2,
    verbose=True,
)

Now let's try the same query:

In [63]:
result = agent_executor.invoke({"input": query})

display(Markdown(result['output']))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'queries': ['semiconductor companies stock recent', 'semiconductor stocks recent', 'semiconductor market update'], 'max_results': 10}`


[0m[33;1m[1;3m[{'content': "## RHCO big news out today.. Finalized ACQUISITION of Morrich Lottery Limited and secured LOTTERY, SPORTSBOOK, and CASINO LICENSES\n\nThis stock has very little volume and its a small float. But looks good with their current financials and this recent completed acquisition to add more to the books.\n\nHere's a quick look at their financials from last quarter:\n\n•$5.7M in revenue last quarter\n\n•Net profitable\n\n•$12.9M in assets\n\n•No convertible notes\n\n•Small float, 24M shares held at DTC (The held at DTC number is the public tradable float)\n\n•Completed acquisition news today, (Finalized ACQUISITION of Morrich Lottery Limited and secured lottery, sportsbook and casino licenses)", 'subreddit': 'unknown', 'score'

Recent Reddit discussions regarding semiconductor companies' stock are a bit varied and not directly focused on specific updates or events related to the semiconductor industry. Here is a brief summary of the discussions found:

1. **RHCO Acquisition News**: There was a discussion about RHCO (a small-cap stock) finalizing the acquisition of Morrich Lottery Limited and securing lottery, sportsbook, and casino licenses. This news highlighted the potential growth and positive financials of the company.

2. **Gold Stocks Performance**: A post discussed the surge in gold prices and the shift in focus towards gold mining companies due to global tensions and economic uncertainties. It outlined different types of gold stocks and their recent performances.

3. **Electric Vehicle Company**: A mention of Damon (DMN), an electric vehicle company, with significant pre-orders for their two-wheel electric vehicles. The post highlighted the company's focus on safety, accessibility tech, and data intelligence in the electric vehicle market.

4. **Investing in Materials Mining Companies**: A newbie investor's interest in warrant stocks of materials mining companies, specifically rare earths. The post inquired about the conversion ratio to common stock and future conversions.

5. **Potential Buyout of LXRX**: A discussion speculating on the potential buyout of LXRX based on various indicators such as management changes, licensing agreements, and golden parachute adoption.

It seems that the discussions on semiconductor companies specifically were not prominent in the recent Reddit posts analyzed. If you are looking for more specific updates on semiconductor stocks, you may need to explore further or provide additional queries for a more targeted search.

Once we're done, we can delete the index to save resources:

In [33]:
index.delete(delete_all=True)

{}

---