[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-sparse-query-expansion.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/langchain/rag-sparse-query-expansion.ipynb)

# Leveraging Sparse Embeddings for Financial Reddit Analysis

In this example, we'll build an agent that can search through Reddit posts about stocks and financial markets
using sparse embeddings. Unlike dense embeddings that capture overall semantic meaning, sparse embeddings excel at
capturing specific keywords and rare terms - perfect for financial contexts where stock tickers (like AAPL, TSLA)
and company names need to be precisely matched.

## Why Sparse Embeddings for Financial Data?

Financial discussions on platforms like Reddit often contain:
1. Stock tickers (PLTR, NVDA) that need exact matching
2. Company names that might be misspelled (Palanteer vs. Palantir)
3. Industry-specific jargon that dense embeddings might miss

Sparse embeddings shine here because they:
- Maintain high precision for exact term matches (crucial for tickers)
- Preserve term frequency information (how often a stock is mentioned)
- Allow for better retrieval of documents with rare but important terms

Let's see how we can implement this approach with Pinecone's sparse vector database.

### Before you begin

You'll need to get an [OpenAI API key](https://platform.openai.com/account/api-keys) and [Pinecone API key](https://app.pinecone.io).

First, we install the prerequisites:

In [1]:
!pip install -qU \
  langchain==0.3.25 \
  langchain-community==0.3.24 \
  langchain-pinecone==0.2.6 \
  langchain-openai==0.3.18 \
  datasets==3.6.0

## Data Prep

We will use a dataset of posts from various finances subreddits, like `r/stocks`, `r/finance`, `r/pennystocks`, and `r/wallstreetbets`.

In [2]:
from datasets import load_dataset

posts = load_dataset("aurelio-ai/reddit-finance", split="train")
posts

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/150k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/107 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'subreddit', 'title', 'selftext'],
    num_rows: 107
})

In [3]:
posts[0]

{'id': '1j0w73o',
 'subreddit': 'stocks',
 'title': 'Rate My Portfolio - r/Stocks Quarterly Thread March 2025',
 'selftext': "Please use this thread to discuss your portfolio, learn of other stock tickers &amp; portfolios like [Warren Buffet's](https://buffett.online/en/portfolio/), and help out users by giving constructive criticism.\n\nWhy quarterly?  Public companies report earnings quarterly; many investors take this as an opportunity to rebalance their portfolios.  We highly recommend you do some reading:  Check out our wiki's list of [relevant posts &amp; book recommendations.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_posts.2C_books.2C_wiki_recommendations)\n\nYou can find stocks on your own by using a scanner like your broker's or [Finviz.](https://finviz.com/screener.ashx)  To help further, here's a list of [relevant websites.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_websites.2Fapps)\n\nIf you don't have a broker yet, see our [list of brokers]

We add the post links like so:

In [4]:
def add_permalink(x):
    x["permalink"] = f"reddit.com/r/{x['subreddit']}/comments/{x['id']}"
    return x

posts = posts.map(add_permalink)

Map:   0%|          | 0/107 [00:00<?, ? examples/s]

## Sparse Vector Index Setup

For our Reddit stocks analysis, we need a specialized vector index that supports sparse vectors.
Unlike dense vectors (which might be 1024 dimensions with most values being non-zero),
sparse vectors can have millions of dimensions but only a few non-zero values.

This sparse representation is perfect for capturing the presence of specific terms like "NVDA" or "earnings call"
without diluting their importance in a dense representation.

## Sparse Embeddings Model

We'll use Pinecone's specialized sparse embedding model that's designed to:
1. Identify important financial terms and stock tickers
2. Create sparse vector representations that emphasize these key terms
3. Enable high-precision matching for financial queries

In [5]:
import os
from getpass import getpass
from pinecone import Pinecone

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY") \
    or getpass("Enter your Pinecone API key: ")

pc = Pinecone()

Enter your Pinecone API key: ··········


In [6]:
from pinecone import ServerlessSpec, CloudProvider, AwsRegion, Metric

index_name = "rag-sparse-query-expansion"

if not pc.has_index(name=index_name):
    pc.create_index(
        name=index_name,
        metric=Metric.DOTPRODUCT,
        dimension=None,
        spec=ServerlessSpec(
            cloud=CloudProvider.AWS,
            region=AwsRegion.US_EAST_1
        ),
        vector_type="sparse",
    )

index = pc.Index(name=index_name)

Now we initialize our sparse embedding model:

In [7]:
from langchain_pinecone.embeddings import PineconeSparseEmbeddings

sparse_embeddings = PineconeSparseEmbeddings(
    model="pinecone-sparse-english-v0", show_progress_bar=True
)

We bring these together to initialize our LangChain `PineconeSparseVectorStore` object.

In [8]:
from langchain_pinecone import PineconeSparseVectorStore

vector_store = PineconeSparseVectorStore(
    index=index, embedding=sparse_embeddings
)

Now we index everything. We will be embedding both the post title and text content, this helps us capture greater context than just embedding one or the other.

In [9]:
from tqdm.auto import tqdm

# we add records in batches of 100 - for large datasets this is essential to
# avoid excessively large requests or out-of-memory errors
batch_size = 100

for i in tqdm(range(0, len(posts), batch_size)):
    # grab batch
    end_idx = min(i + batch_size, len(posts))
    batch = posts.select(range(i, end_idx))
    # merge title and content before embedding
    title_and_content = [
        f"## {post['title']}\n\n{post['selftext']}" for post in batch
    ]
    # prep our metadata
    metadata_batch = [
        {
            "subreddit": post["subreddit"],
            "permalink": post["permalink"]
        } for post in batch
    ]
    # add our data to the vec store
    vector_store.add_texts(
        texts=title_and_content,
        metadatas=metadata_batch[i:end_idx]
    )

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

## Query Expansion: Bridging ticker and company names for financial search

One of the biggest challenges in financial search is the vocabulary mismatch problem:
- Users might search for "Palanteer" when they mean "Palantir" (PLTR)
- They might use "NVDA stock drop" when relevant posts mention "Nvidia shares plummeting"
- They might search for a company name when Reddit posts use only the ticker symbol

To solve this, we'll implement query expansion - a technique that transforms a single query
into multiple alternative queries that capture the same intent but with different terminology.

We'll be using OpenAI for this part, for which we need an [API key](https://platform.openai.com) which we enter below:

In [10]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") \
    or getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········


Now we define our LLM that will be used _within_ our query expansion tool to create our expanded queries. We can use `with_structured_output` to force a particular output format.

In [11]:
from pydantic import BaseModel
from langchain_openai import ChatOpenAI

# the pydantic BaseModel defines the schema we want for our LLM output
class QueryArray(BaseModel):
    queries: list[str]

# LLM for query expansion with high temp for increased "creativity"
# and structured output
expansion_llm = ChatOpenAI(
    model="gpt-4.1-mini", temperature=0.7
).with_structured_output(QueryArray)

Next we define the prompt that explains to our `expansion_llm` what it's purpose is and how it should tackle it's query expansion task.

In [14]:
expand_query_prompt = (
    "You are an expert at expanding search queries to improve search "
    "results. Given a query about stocks, companies, or financial "
    "markets, generate {num_expansions} alternative queries that "
    "capture the same intent but use different wording, synonyms, or "
    "related concepts.\n"
    "For company names, include both the company name and ticker "
    "symbol when possible. For misspelled company names, include the "
    "correct spelling in the alternatives. For general topics, i.e. "
    '"tech company performance", expand these queries to specific '
    'companies you know of, like "Google", "Apple", "Nvidia".\n'
    'Here are some examples:\n"For example, a query about Google could expand to: ["GOOGL", '
    '"FAANG company performance" should expand to: "'
    '["META performance", "GOOGL performance", "AAPL performance"]\n'
    '"Tesla stock drop" should expand to: '
    '["TSLA share price decrease", "Tesla Motors stock market decline",'
    ' "EV companies market decline", "BYD stock performance",'
    ' "Elon Musk company stock performance"]\n'
    "The goal is to get a broader yet relevant scope of results.\n"
    "Original query: {query}"
)

We can run the LLM directly like so:

In [15]:
# get response from LLM
response = expansion_llm.invoke(expand_query_prompt.format(
    query="nvidia stonks",
    num_expansions=5
))
response

QueryArray(queries=['NVIDIA stock price', 'NVDA share performance', 'NVIDIA Corporation stock market', 'Graphics card company stock NVDA', 'Semiconductor stocks including NVIDIA'])

However, we'd like this query expansion functionality to be an optional _tool_ that our main agent can call. For this, we define the `expand_query` tool. The main agent will decide whether to use this tool (or not) based on the instructions we provide in the function docstring below:

In [16]:
from langchain_core.tools import tool

@tool
def expand_query(query: str, num_expansions: int = 5) -> list[str]:
    """
    Expand a search query into multiple alternative queries to improve search
    results. Ideally you must capture the broader scope of queries beyond the
    provided query.
    This is useful when:
    - The original query is vague or ambiguous
    - You need to find synonyms or related terms
    - You want to cover broader aspects of the topic
    - The query contains misspelled company names or stock tickers

    Args:
        query: The original query that needs to be expanded
        num_expansions: Number of alternative queries to generate

    Returns:
        A list of expanded queries including the original query
    """
    try:
        # get response from LLM
        response = expansion_llm.invoke(expand_query_prompt.format(
            query=query,
            num_expansions=num_expansions
        ))
        # return the expanded queries
        return response.queries
    except Exception as e:
        return [f"Error expanding query: {str(e)}"]

## Searching Reddit with Sparse Vectors

Now that we have our query expansion mechanism, let's create a tool that leverages both
sparse embeddings and query expansion to search through Reddit posts about stocks.

The magic happens in the combination:
1. Query expansion generates multiple versions of the search query
2. Each expanded query is converted to a sparse vector representation
3. Pinecone's sparse similarity search finds the most relevant posts
4. Results are deduplicated and ranked by relevance

In [17]:
from typing import Any
from langchain_core.documents import Document

@tool
def reddit_stock_mentions(
    queries: list[str], max_results: int = 10, **kwargs
) -> list[dict[str, Any]]:
    """
    Search through Reddit posts from finance subreddits like r/wallstreetbets,
    r/investing, and r/pennystocks to find mentions of stocks, companies, or
    market trends.

    When using this tool:
    - For company searches, include both the company name and ticker symbol if
      known (e.g., "Tesla TSLA")
    - For misspelled company names, provide the correct spelling
    - Consider alternative terms that might be used on Reddit (e.g., "$PLTR"
      instead of just "Palantir")
    - Be specific about the type of information you're looking for (e.g.,
      "NVIDIA earnings reaction")

    Args:
        query: search query to find relevant posts about stocks or companies
        max_results: how many posts to return

    Returns:
        A list of relevant Reddit posts with metadata
    """
    try:
        seen_docs = set()  # To track unique documents
        docs: list[tuple[Document, float]] = []

        # Perform the search using sparse vectors
        # This is where the magic happens - sparse vectors excel at matching
        # specific terms like stock tickers and company names
        for query in queries:
            docs.extend(
                vector_store.similarity_search_with_score(
                    query=query, k=max_results
                )
            )

        unique_docs: list[dict] = []

        # Add unique results
        for doc, score in docs:
            # Create a unique identifier for the document
            doc_id = f"{doc.metadata.get('permalink', '')}"

            # Only add if we haven't seen this document before
            if doc_id not in seen_docs:
                seen_docs.add(doc_id)
                unique_docs.append({
                    "content": doc.page_content,
                    "subreddit": doc.metadata.get("subreddit", "unknown"),
                    "score": score,
                    "url": doc.metadata.get("permalink", ""),
                    "matched_query": query,
                })

        # Sort by score (if available) and limit to max_results
        unique_docs.sort(key=lambda x: x.get("score", 0), reverse=True)
        return unique_docs[:max_results]

    except Exception as e:
        return [{"error": f"Error searching Reddit mentions: {str(e)}"}]


## Putting It All Together: The Reddit Stocks Agent

Now we'll create an agent that orchestrates our tools to help users find and analyze
what Reddit is saying about stocks and companies. The agent will:

1. Use query expansion to handle misspellings and terminology variations
2. Leverage sparse embeddings for precise matching of financial terms
3. Analyze and summarize the Reddit sentiment about stocks

This approach combines the best of both worlds:
- LLMs for understanding user intent and generating alternative queries
- Sparse vector search for high-precision retrieval of financial discussions

We start with setting up our system prompt and chat prompt template, which will include placeholders for chat history _and_ intermediate steps where the agent will be able to store information returned from it's various tools before responding to the user.

In [18]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# Define a system prompt that emphasizes using query expansion
system_prompt = (
    "You are a financial research assistant that helps users find information "
    "about stocks and companies from Reddit discussions.\n\n"
    "ALWAYS use the expand_query tool first when searching for information. "
    "This is critical because:\n"
    "1. It helps find alt terms and synonyms that Reddit users might use\n"
    "2. It corrects misspelled company names or stock tickers\n"
    "3. It improves search coverage by generating multiple related queries\n\n"
    "After expanding the query, use the reddit_stock_mentions tool with the "
    "expanded queries to find relevant posts.\n"
    "Remember to analyze the results and provide a comprehensive summary of "
    "what Reddit users are saying about the stock or company in question."
)

# Create the prompt template with message history
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

Now we setup chat memory, we'll use a simple conversation buffer (ie save everything).

In [19]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

  memory = ConversationBufferMemory(


Now bring all of this together to create our agent:

In [24]:
from langchain.agents import create_openai_tools_agent, AgentExecutor

main_llm = ChatOpenAI(model="gpt-4.1", temperature=0.0)
tools = [expand_query, reddit_stock_mentions]

agent = create_openai_tools_agent(
    llm=main_llm, tools=tools, prompt=prompt
)

# package into an agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True,
)

## Testing Our Agent with Real-World Queries

Let's put our agent to the test with some realistic queries that demonstrate its capabilities:
1. Handling misspelled company names (Palanteer → Palantir)
2. Finding discussions about specific events (NVIDIA earnings)
3. Exploring broader topics (biotech companies)

The agent will:
- Expand each query to cover different terminology
- Search using sparse vectors for precise matching
- Analyze and summarize the Reddit sentiment

In [25]:
from IPython.display import Markdown, display

query = "What happened to semiconductor companies' stock recently?"

# run the agent with the query
result = agent_executor.invoke({"input": query})

display(Markdown(result['output']))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `expand_query` with `{'query': 'semiconductor companies stock recent news', 'num_expansions': 6}`


[0m[36;1m[1;3m['semiconductor industry stock market updates', "latest news on chip manufacturers' shares", 'recent developments in semiconductor company stocks', 'stock performance news for companies like NVIDIA and AMD', 'market trends in semiconductor stocks such as Intel and Qualcomm', "updates on semiconductor firms' share prices and financial news"][0m[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'queries': ['semiconductor industry stock market updates', "latest news on chip manufacturers' shares", 'recent developments in semiconductor company stocks', 'stock performance news for companies like NVIDIA and AMD', 'market trends in semiconductor stocks such as Intel and Qualcomm', "updates on semiconductor firms' share prices and financial news"], 'max_results': 15}`


[0m[33;1m[1;3m[{'content': "##

Here's a summary of what Reddit users are saying about the recent performance of semiconductor companies' stocks:

1. Tariffs and Trade Tensions: There is significant concern about the impact of renewed tariffs and trade tensions, especially between the US and China. Intel's CFO recently stated that tariffs have increased the likelihood of an economic slowdown or recession, and this uncertainty has led to a wider-than-normal revenue forecast range for Intel. The company’s stock dropped more than 5% after its latest earnings call, despite beating expectations, due to cautious guidance and tariff-related risks.

2. Stock Performance: The overall sentiment is cautious to negative for semiconductor stocks in the short term. Intel’s stock specifically took a hit after its earnings, and there is a general sense that the sector is facing headwinds from macroeconomic uncertainty, trade policy, and potential slowdowns in demand.

3. Broader Market Impact: Some posts mention that, aside from a few exceptions like NVIDIA, most stocks (including semiconductors) are struggling to find sustainable catalysts for upward movement. The market is seen as having more downside risk than upside at the moment, with many investors waiting for further clarity after Q2 and Q3 earnings.

4. Demand and Inventory: Intel noted that some of its recent sales were driven by customers stockpiling chips ahead of tariffs, which could mean softer demand in future quarters as inventories are worked through.

5. Company-Specific News: While most discussion is about large players like Intel, NVIDIA, and AMD, there is also mention of smaller tech and semiconductor-related companies making moves, such as SS Innovations uplisting to Nasdaq, but this is more about company milestones than sector-wide trends.

In summary, Reddit users are currently wary of semiconductor stocks due to trade policy risks, macroeconomic uncertainty, and cautious corporate guidance. Intel’s recent earnings and commentary have amplified concerns about the sector’s near-term outlook, and most investors are taking a wait-and-see approach until more data comes in from upcoming earnings reports.

## How does our search perform without query expansion?

Let's see what we'd be missing out on if we weren't using query expansion in our search.

We'll set up another agent that doesn't have access to the query expansion tool.

In [28]:
memory2 = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# this time initialize with ONLY the reddit_stock_mentions
agent_wo_expansion = create_openai_tools_agent(
    llm=main_llm, tools=[reddit_stock_mentions], prompt=prompt
)

# package into another agent executor
agent_executor = AgentExecutor(
    agent=agent_wo_expansion,
    tools=[reddit_stock_mentions],
    memory=memory2,
    verbose=True,
)

Now let's try the same query _without_ the query expansion tool _and_ while telling the agent just use a single query:

In [30]:
result = agent_executor.invoke({"input": query + " just use one search query"})

display(Markdown(result['output']))



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'queries': ['semiconductor stocks recent performance'], 'max_results': 10}`


[0m[36;1m[1;3m[{'content': '## USAS : Q1 Earnings Estimate for Americas Silver Issued By Cormark\n\nAmericas Silver Corp ([NYSEAMERICAN:USAS](https://www.marketbeat.com/stocks/NYSEAMERICAN/USAS/)\xa0–\xa0[Free Report](https://www.marketbeat.com/arnreports/ReportTickerOptin.aspx?RegistrationCode=TickerHyperlink&amp;Prefix=NYSEAMERICAN&amp;Symbol=USAS)) – Investment analysts at Cormark issued their Q1 2025 earnings per share (EPS) estimates for shares of Americas Silver in a research report issued to clients and investors on Tuesday, April 22nd. Cormark analyst N. Dion expects that the company will post earnings per share of $0.00 for the quarter. Cormark currently has a “Moderate Buy” rating on the stock. The consensus estimate for Americas Silver’s current full-year earnings is ($0.11) per share.\xa0\n\nS

Based on recent Reddit discussions using the query "semiconductor stocks recent performance," there is very little direct conversation specifically about semiconductor stocks' recent movements. Most of the top posts are focused on other sectors (such as gold and mining stocks) or general market sentiment, with some discussion about volatility and retail investors buying dips.

This suggests that while the broader market is experiencing volatility and sector rotation (with some investors moving from tech/semiconductors to perceived safer assets like gold), there hasn't been a major, focused discussion on semiconductor stocks' recent performance in the most active Reddit threads. The lack of direct posts may indicate that, at least in the last few days, semiconductor stocks have not been the primary topic of retail investor attention compared to other sectors.

If you want a more detailed or company-specific update, let me know which semiconductor companies or tickers you're interested in!

Once we're done, we can delete the index to save resources:

In [31]:
index.delete(delete_all=True)

{}

---