# Leveraging Sparse Embeddings for Financial Reddit Analysis

In this example, we'll build an agent that can search through Reddit posts about stocks and financial markets
using sparse embeddings. Unlike dense embeddings that capture overall semantic meaning, sparse embeddings excel at 
capturing specific keywords and rare terms - perfect for financial contexts where stock tickers (like AAPL, TSLA) 
and company names need to be precisely matched.

## Why Sparse Embeddings for Financial Data?

Financial discussions on platforms like Reddit often contain:
1. Stock tickers (PLTR, NVDA) that need exact matching
2. Company names that might be misspelled (Palanteer vs. Palantir)
3. Industry-specific jargon that dense embeddings might miss

Sparse embeddings shine here because they:
- Maintain high precision for exact term matches (crucial for tickers)
- Preserve term frequency information (how often a stock is mentioned)
- Allow for better retrieval of documents with rare but important terms

Let's see how we can implement this approach with Pinecone's sparse vector database.

In [1]:
import os
import ast
import json
from typing import Any, Dict, List

from langchain_core.messages import HumanMessage
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough, RunnableWithMessageHistory
from langchain_openai import ChatOpenAI
from langchain_pinecone import PineconeSparseVectorStore
from langchain_pinecone.embeddings import PineconeSparseEmbeddings
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from pinecone import Pinecone


# Check if data.json exists
# Load the Reddit data from JSON file
with open('data/finance_investing_data.json', 'r', encoding='utf-8') as f:
    reddit_data = json.load(f)

print(f"Loaded {len(reddit_data)} posts from data.json")

# Prepare the posts for indexing
posts = []
for item in reddit_data:
    # Extract relevant fields from the dataset
    # Assuming JSON has same columns: [id, title, selftext, subreddit, body]
    posts.append({
        "id": item.get("id", ""),
        "title": item.get("title", ""),
        "content": item.get("selftext", "") if item.get("selftext") else "",
        "subreddit": item.get("subreddit", ""),
        "score": 0,  # Score not available in this dataset
        "created_utc": "",  # Timestamp not available in this dataset
        "permalink": f"https://reddit.com/r/{item.get('subreddit', 'all')}/comments/{item.get('id', '')}/"
    })


Loaded 107 posts from data.json


In [2]:
posts[:5]

[{'id': '1j0w73o',
  'title': 'Rate My Portfolio - r/Stocks Quarterly Thread March 2025',
  'content': "Please use this thread to discuss your portfolio, learn of other stock tickers &amp; portfolios like [Warren Buffet's](https://buffett.online/en/portfolio/), and help out users by giving constructive criticism.\n\nWhy quarterly?  Public companies report earnings quarterly; many investors take this as an opportunity to rebalance their portfolios.  We highly recommend you do some reading:  Check out our wiki's list of [relevant posts &amp; book recommendations.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_posts.2C_books.2C_wiki_recommendations)\n\nYou can find stocks on your own by using a scanner like your broker's or [Finviz.](https://finviz.com/screener.ashx)  To help further, here's a list of [relevant websites.](https://www.reddit.com/r/stocks/wiki/index/#wiki_relevant_websites.2Fapps)\n\nIf you don't have a broker yet, see our [list of brokers](https://www.reddit.co

## Sparse Vector Index Setup

For our Reddit stocks analysis, we need a specialized vector index that supports sparse vectors.
Unlike dense vectors (which might be 1024 dimensions with most values being non-zero),
sparse vectors can have millions of dimensions but only a few non-zero values.

This sparse representation is perfect for capturing the presence of specific terms like "NVDA" or "earnings call"
without diluting their importance in a dense representation.

## Sparse Embeddings Model

We'll use Pinecone's specialized sparse embedding model that's designed to:
1. Identify important financial terms and stock tickers
2. Create sparse vector representations that emphasize these key terms
3. Enable high-precision matching for financial queries

In [3]:
PINECONE_API_KEY = os.environ.get("PINECONE_API_KEY", "your-api-key")
PINECONE_INDEX_NAME = "reddit-stocks-index"
NAMESPACE = "finance-subreddits"

pc = Pinecone(api_key=PINECONE_API_KEY)

try:
    index_info = pc.describe_index(PINECONE_INDEX_NAME)
    if index_info.vector_type != "sparse":
        raise ValueError(
            f"Index {PINECONE_INDEX_NAME} exists but is not a sparse index"
        )
except Exception as e:
    print(f"Index not found or error: {e}")
    print("Please create a sparse index first with the appropriate configuration")
    # Note: Index creation would go here in a real application
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=None,
        metric="dotproduct",
        spec={"serverless": {"cloud": "aws", "region": "us-east-1"}},
        vector_type="sparse",
    )

# Initialize the sparse embeddings model
sparse_embeddings = PineconeSparseEmbeddings(
    model="pinecone-sparse-english-v0", show_progress_bar=True
)

# Initialize the vector store
index = pc.Index(PINECONE_INDEX_NAME)
vector_store = PineconeSparseVectorStore(
    index=index, embedding=sparse_embeddings, namespace=NAMESPACE
)

In [20]:
# Index the posts
texts = [f"Title: {post['title']}\nContent: {post['content']}" for post in posts]
metadatas = [{
    "subreddit": post["subreddit"],
    "score": post["score"],
    "created_utc": post["created_utc"],
    "permalink": post["permalink"]
} for post in posts]

# Add texts in batches to avoid memory issues
batch_size = 100
for i in range(0, len(texts), batch_size):
    end_idx = min(i + batch_size, len(texts))
    vector_store.add_texts(
        texts=texts[i:end_idx],
        metadatas=metadatas[i:end_idx],
        namespace=NAMESPACE
    )
    print(f"Indexed batch {i//batch_size + 1}/{(len(texts) + batch_size - 1)//batch_size}")

print("Indexing complete.")

  0%|          | 0/2 [00:00<?, ?it/s]

Indexed batch 1/2


  0%|          | 0/1 [00:00<?, ?it/s]

Indexed batch 2/2
Indexing complete. Running example queries...


## Query Expansion: Bridging ticker and company names for financial search

One of the biggest challenges in financial search is the vocabulary mismatch problem:
- Users might search for "Palanteer" when they mean "Palantir" (PLTR)
- They might use "NVDA stock drop" when relevant posts mention "Nvidia shares plummeting"
- They might search for a company name when Reddit posts use only the ticker symbol

To solve this, we'll implement query expansion - a technique that transforms a single query 
into multiple alternative queries that capture the same intent but with different terminology.

In [16]:
@tool
def expand_query(query: str, num_expansions: int = 3) -> List[str]:
    """
    Expand a search query into multiple alternative queries to improve search results.
    This is useful when:
    - The original query is vague or ambiguous
    - You need to find synonyms or related terms
    - You want to cover different aspects of the same topic
    - The query contains misspelled company names or stock tickers
    
    For example, a query like "Tesla stock drop" could be expanded to:
    - "TSLA share price decrease"
    - "Tesla Motors stock market decline"
    - "Elon Musk company stock performance"
    
    Args:
        query: The original query that needs to be expanded
        num_expansions: Number of alternative queries to generate
    
    Returns:
        A list of expanded queries including the original query
    """
    try:
        # Create the LLM for query expansion
        expansion_llm = ChatOpenAI(temperature=0.7)
        
        # Define the prompt template
        prompt = """
        You are an expert at expanding search queries to improve search results.
        
        Given a query about stocks, companies, or financial markets, generate {num_expansions} alternative 
        queries that capture the same intent but use different wording, synonyms, or related concepts.
        
        For company names, include both the company name and ticker symbol when possible.
        For misspelled company names, include the correct spelling in the alternatives.
        For general topics, i.e. "tech company performance", expand these queries to specific companies you know of, like "Google", "Apple", "Nvidia".

        For example, a query about Google could expand to: ["GOOGL", "Alphabet"]
        Another example, a query about FAANG company performance could expand to: ["META performance", "GOOGL performance", "AAPL performance"]
        and so on.
        
        Original query: {query}
        
        Return ONLY a Python list of strings with the alternative queries. 
        Do not include any explanations or additional text.
        """
        
        # Get response from LLM
        response = expansion_llm.invoke(prompt.format(
            query=query, 
            num_expansions=num_expansions
        ))
        
        # Extract the list of alternative queries from the response
        response_text = response.content
        
        # Clean up the response to make it a valid Python list
        if "```" in response_text:
            response_text = response_text.split("```")[1]
            if response_text.startswith("python"):
                response_text = response_text[6:]
        
        # Safely evaluate the string as a Python expression
        try:
            expanded_queries = ast.literal_eval(response_text.strip())
            if not isinstance(expanded_queries, list):
                expanded_queries = [response_text.strip()]
        except:
            # Fallback if parsing fails
            expanded_queries = [line.strip() for line in response_text.strip().split('\n') 
                               if line.strip() and not line.strip().startswith('[') and not line.strip().endswith(']')]
        
        # Add the original query to the list
        expanded_queries.append(query)
        
        return expanded_queries
        
    except Exception as e:
        return [f"Error expanding query: {str(e)}", query]

## Searching Reddit with Sparse Vectors

Now that we have our query expansion mechanism, let's create a tool that leverages both 
sparse embeddings and query expansion to search through Reddit posts about stocks.

The magic happens in the combination:
1. Query expansion generates multiple versions of the search query
2. Each expanded query is converted to a sparse vector representation
3. Pinecone's sparse similarity search finds the most relevant posts
4. Results are deduplicated and ranked by relevance

In [17]:
@tool
def reddit_stock_mentions(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """
    Search through Reddit posts from finance subreddits like r/wallstreetbets, r/investing, and r/pennystocks
    to find mentions of stocks, companies, or market trends. This tool is useful for finding what retail
    investors are discussing about specific companies or market events.
    
    When using this tool:
    - For company searches, include both the company name and ticker symbol if known (e.g., "Tesla TSLA")
    - For misspelled company names, provide the correct spelling
    - Consider alternative terms that might be used on Reddit (e.g., "$PLTR" instead of just "Palantir")
    - Be specific about the type of information you're looking for (e.g., "NVIDIA earnings reaction")
    
    NOTE: This tool automatically uses query expansion to improve search results by generating
    alternative queries and combining the results.
    
    Args:
        query: The search query to find relevant Reddit posts about stocks or companies
        max_results: Maximum number of Reddit posts to return
    
    Returns:
        A list of relevant Reddit posts with metadata
    """
    try:
        all_results = []
        seen_docs = set()  # To track unique documents
        
        # Search with each query

        # Perform the search using sparse vectors
        # This is where the magic happens - sparse vectors excel at matching specific terms
        # like stock tickers and company names with high precision
        docs = vector_store.similarity_search(query=query, k=max_results)
        
        # Add unique results
        for doc in docs:
            # Create a unique identifier for the document
            doc_id = f"{doc.metadata.get('permalink', '')}"
            
            # Only add if we haven't seen this document before
            if doc_id not in seen_docs:
                seen_docs.add(doc_id)
                all_results.append({
                    "content": doc.page_content,
                    "subreddit": doc.metadata.get("subreddit", "unknown"),
                    "score": doc.metadata.get("score", 0),
                    "date": doc.metadata.get("created_utc", "unknown"),
                    "url": doc.metadata.get("permalink", ""),
                    "matched_query": query,
                })
        
        # Sort by score (if available) and limit to max_results
        all_results.sort(key=lambda x: x.get("score", 0), reverse=True)
        return all_results[:max_results]
        
    except Exception as e:
        return [{"error": f"Error searching Reddit mentions: {str(e)}"}]


## Putting It All Together: The Reddit Stocks Agent

Now we'll create an agent that orchestrates our tools to help users find and analyze
what Reddit is saying about stocks and companies. The agent will:

1. Use query expansion to handle misspellings and terminology variations
2. Leverage sparse embeddings for precise matching of financial terms
3. Analyze and summarize the Reddit sentiment about stocks

This approach combines the best of both worlds:
- LLMs for understanding user intent and generating alternative queries
- Sparse vector search for high-precision retrieval of financial discussions

In [18]:
# Initialize the LLM
llm = ChatOpenAI()

# Define the tools list
tools = [expand_query, reddit_stock_mentions]

# Define a system prompt that emphasizes using query expansion
system_prompt = """You are a financial research assistant that helps users find information about stocks and companies from Reddit discussions.

ALWAYS use the expand_query tool first when searching for information. This is critical because:
1. It helps find alternative terms and synonyms that Reddit users might use
2. It corrects misspelled company names or stock tickers
3. It improves search coverage by generating multiple related queries

After expanding the query, use the reddit_stock_mentions tool with the expanded queries to find relevant posts.

Remember to analyze the results and provide a comprehensive summary of what Reddit users are saying about the stock or company in question.
"""

# Create the prompt template with message history
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

# Create a memory for the agent
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Create the agent using the create_tool_calling_agent approach
from langchain.agents import create_tool_calling_agent, AgentExecutor

agent = create_tool_calling_agent(
    llm=llm,
    tools=tools,
    prompt=prompt
)

# Create the agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True,
    handle_parsing_errors=True,
)

## Testing Our Agent with Real-World Queries

Let's put our agent to the test with some realistic queries that demonstrate its capabilities:
1. Handling misspelled company names (Palanteer → Palantir)
2. Finding discussions about specific events (NVIDIA earnings)
3. Exploring broader topics (biotech companies)

The agent will:
- Expand each query to cover different terminology
- Search using sparse vectors for precise matching
- Analyze and summarize the Reddit sentiment

In [21]:
query = "What happened to semiconductor companies' stock recently?"

print(f"\n\n===== QUERY: {query} =====\n")

# Run the agent with the query
result = agent_executor.invoke({
    "input": query,
    "chat_history": memory
})

# Print the result
print(f"RESULT: {result['output']}")



===== QUERY: What happened to semiconductor companies' stock recently? =====



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `expand_query` with `{'query': 'semiconductor companies stock performance', 'num_expansions': 3}`


[0m[36;1m[1;3m['Intel stock performance', 'AMD stock performance', 'Nvidia stock performance', 'semiconductor companies stock performance'][0m[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'query': 'Intel stock performance', 'max_results': 5}`


[0m[33;1m[1;3m[{'content': 'Title: Intel CFO says tariffs increase chance for economic slowdown, recession getting likelier\nContent: Source: [https://www.cnbc.com/2025/04/24/intel-cfo-says-tariffs-raise-chance-for-economic-slowdown-recession.html](https://www.cnbc.com/2025/04/24/intel-cfo-says-tariffs-raise-chance-for-economic-slowdown-recession.html)\n\n&gt;  \nIntel\xa0CFO David Zinsner said\xa0President Donald Trump’s\xa0tariffs and retaliation from other countries has increased

## How does our search perform without query expansion?

Let's see what we'd be missing out on if we weren't using query expansion in our search.

We'll set up another agent that doesn't have access to the query expansion tool, and use a query we've used above

In [22]:
# Show performance before & after query expansion (craft specific query)

# reranker should be an _extra_ 

# tool vs baked in use of query expansion


agent_wo_expansion = create_tool_calling_agent(
    llm=llm,
    tools=[reddit_stock_mentions],
    prompt=prompt
)

query = "What happened to semiconductor companies' stock recently?"

AgentExecutor(
    agent=agent_wo_expansion,
    tools=tools,
    memory=ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True
    ),
    verbose=True,
    handle_parsing_errors=True,
).invoke({
    "input": query,
    "chat_history": memory
})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `reddit_stock_mentions` with `{'query': 'semiconductor companies stock recent', 'max_results': 10}`


[0m[33;1m[1;3m[{'content': "Title: RHCO big news out today.. Finalized ACQUISITION of Morrich Lottery Limited and secured LOTTERY, SPORTSBOOK, and CASINO LICENSES\nContent: This stock has very little volume and its a small float. But looks good with their current financials and this recent completed acquisition to add more to the books.\n\nHere's a quick look at their financials from last quarter:\n\n•$5.7M in revenue last quarter\n\n•Net profitable\n\n•$12.9M in assets\n\n•No convertible notes\n\n•Small float, 24M shares held at DTC (The held at DTC number is the public tradable float)\n\n•Completed acquisition news today, (Finalized ACQUISITION of Morrich Lottery Limited and secured lottery, sportsbook and casino licenses)", 'subreddit': 'pennystocks', 'score': 0.0, 'date': '', 'url': 'https://reddit.com/r/pen

{'input': "What happened to semiconductor companies' stock recently?",
 'chat_history': [HumanMessage(content="What happened to semiconductor companies' stock recently?", additional_kwargs={}, response_metadata={}),
  AIMessage(content="Recent Reddit discussions related to semiconductor companies' stocks did not appear in the search results. However, I found some relevant information from other discussions:\n\n1. In the subreddit r/investing, there was a post discussing gold stocks hitting $3,300/oz. The post highlighted the significant increase in gold prices and the shift of focus towards gold mining companies due to uncertainties in the market. It also provided insights on different types of gold stocks like major producers, junior miners, and streaming companies.\n\n2. Another post from the subreddit r/wallstreetbets discussed Alphabet's (Google) upcoming first-quarter earnings. It mentioned analysts' expectations for revenue growth and operating income, especially focusing on the 

In [10]:
# Clean up

index.delete(delete_all=True, namespace=NAMESPACE)

{}