[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/lexical-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/lexical-search.ipynb)

# Lexical Search

In this notebook, you'll learn how to use Pinecone for lexical (keyword) search.

Lexical search is a form of retrieval that allows you to find records that most exactly match the words or phrases in a query. Lexical search uses sparse vectors, which have a very large number of dimensions, where only a small proportion of values are non-zero. The dimensions represent words from a dictionary, and the values represent the contextual importance of these words in the document. Words are scored independently and then summed, with the most similar records scored highest.

You can read more about how sparse retrieval works [here](https://www.pinecone.io/learn/sparse-retrieval/).

While semantic search (over a dense index) allows you to find records that are most similar in meaning and context to a query, lexical search (over a sparse index) lets you search for exact token matches like keywords, acronyms, stock tickers, or even proprietary domain terminology, like a product name.

In the example below, we'll search over financial headlines to find news related to Apple.

## 1. Setup

First, let's install the necessary libraries, define some helper functions, and set the API keys we will need to use in this notebook.

In [1]:
!pip install -qU \
  pinecone~=7.3 \
  pinecone-notebooks==0.1.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.4 kB[0m [31m17.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/587.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.6/587.6 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

### Helper functions

In [82]:
def print_hits(results):
    for hit in results['result']['hits']:
        print(f"id: {hit['_id']}, score: {round(hit['_score'], 2)} text: {hit['fields']['chunk_text']}")

    if len(results['result']['hits']) == 0:
        print("No results found")

### Get and set the Pinecone API key

We will need a free [Pinecone API key](https://docs.pinecone.io/guides/get-started/quickstart). The code below will either help you sign up for a new Pinecone account or authenticate you. Then it will create a new API key and set it as an environment variable. If you are not running in a Colab environment, it will prompt you to enter the API key and then set it in the environment.

In [3]:
import os
from getpass import getpass

def get_pinecone_api_key():
    """
    Get Pinecone API key from environment variable or prompt user for input.
    Returns the API key as a string.

    Only necessary for notebooks. When using Pinecone yourself,
    you can use environment variables or the like to set your API key.
    """
    api_key = os.environ.get("PINECONE_API_KEY")

    if api_key is None:
        try:
            # Try Colab authentication if available
            from pinecone_notebooks.colab import Authenticate
            Authenticate()
            # If successful, key will now be in environment
            api_key = os.environ.get("PINECONE_API_KEY")
        except ImportError:
            # If not in Colab or authentication fails, prompt user for API key
            print("Pinecone API key not found in environment.")
            api_key = getpass("Please enter your Pinecone API key: ")
            # Save to environment for future use in session
            os.environ["PINECONE_API_KEY"] = api_key

    return api_key

api_key = get_pinecone_api_key()

## 2. Create Pinecone index and load data

### Initializing Pinecone

In [75]:
from pinecone import Pinecone

# Initialize client

pc = Pinecone(
    api_key=api_key,
    source_tag="pinecone_examples:docs:lexical_search"
)

### Create a Pinecone index with integrated embedding

Lexical search requires three pieces: a processed data source (chunks, or records in Pinecone), an embedding model, and a vector database.

Integrated embedding allows you to specify the creation of a Pinecone index with a specific Pinecone-hosted embedding model, which makes it easy to interact with the index. To learn more about integrated embedding, including what other models are available, check it out [here](https://docs.pinecone.io/guides/get-started/overview#integrated-embedding).


Here, we'll create an index with the [pinecone-sparse-english-v0](https://docs.pinecone.io/models/pinecone-sparse-english-v0) embedding model. We also specify a mapping for what field in our records we will embed with this model.

In [77]:

index_name = "lexical-search"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model": "pinecone-sparse-english-v0",
            "field_map":{"text": "chunk_text"}
        }
    )

# Initialize index client
index = pc.Index(name=index_name)

# View index stats
index.describe_index_stats()

{'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'headlines': {'vector_count': 96}},
 'total_vector_count': 96,
 'vector_type': 'sparse'}

### Creating our dataset

Sparse indexes excel when you need exact token matching and predictable precision. For example:

- searching over terminology-heavy domains where subword tokenization can pose a problem, such as medical, legal, and financial domains
- queries that need precise entity matching, such as over proper nouns, names, part numbers, stock tickers, products etc that are difficult to put into metadata

The example data set below contains financial headlines.

Note: This data set has only 96 records, the maximum allowed by the embedding model in one batch. If we had more than 96 records, we'd have to [upsert in batches](https://docs.pinecone.io/guides/index-data/upsert-data#upsert-in-batches).


In [78]:
# Financial headlines
data = [
    { "_id": "vec1", "chunk_text": "Apple Inc. issued a $10 billion corporate bond in 2023." },
    { "_id": "vec2", "chunk_text": "ETFs tracking the S&P 500 outperformed active funds last year." },
    { "_id": "vec3", "chunk_text": "Tesla's options volume surged after the latest earnings report." },
    { "_id": "vec4", "chunk_text": "Dividend aristocrats are known for consistently raising payouts." },
    { "_id": "vec5", "chunk_text": "The Federal Reserve raised interest rates by 0.25% to curb inflation." },
    { "_id": "vec6", "chunk_text": "Unemployment hit a record low of 3.7% in Q4 of 2024." },
    { "_id": "vec7", "chunk_text": "The CPI index rose by 6% in July 2024, raising concerns about purchasing power." },
    { "_id": "vec8", "chunk_text": "GDP growth in emerging markets outpaced developed economies." },
    { "_id": "vec9", "chunk_text": "Amazon's acquisition of MGM Studios was valued at $8.45 billion." },
    { "_id": "vec10", "chunk_text": "Alphabet reported a 20% increase in advertising revenue." },
    { "_id": "vec11", "chunk_text": "ExxonMobil announced a special dividend after record profits." },
    { "_id": "vec12", "chunk_text": "Tesla plans a 3-for-1 stock split to attract retail investors." },
    { "_id": "vec13", "chunk_text": "Credit card APRs reached an all-time high of 22.8% in 2024." },
    { "_id": "vec14", "chunk_text": "A 529 college savings plan offers tax advantages for education." },
    { "_id": "vec15", "chunk_text": "Emergency savings should ideally cover 6 months of expenses." },
    { "_id": "vec16", "chunk_text": "The average mortgage rate rose to 7.1% in December." },
    { "_id": "vec17", "chunk_text": "The SEC fined a hedge fund $50 million for insider trading." },
    { "_id": "vec18", "chunk_text": "New ESG regulations require companies to disclose climate risks." },
    { "_id": "vec19", "chunk_text": "The IRS introduced a new tax bracket for high earners." },
    { "_id": "vec20", "chunk_text": "Compliance with GDPR is mandatory for companies operating in Europe." },
    { "_id": "vec21", "chunk_text": "What are the best-performing green bonds in a rising rate environment?" },
    { "_id": "vec22", "chunk_text": "How does inflation impact the real yield of Treasury bonds?" },
    { "_id": "vec23", "chunk_text": "Top SPAC mergers in the technology sector for 2024." },
    { "_id": "vec24", "chunk_text": "Are stablecoins a viable hedge against currency devaluation?" },
    { "_id": "vec25", "chunk_text": "Comparison of Roth IRA vs 401(k) for high-income earners." },
    { "_id": "vec26", "chunk_text": "Stock splits and their effect on investor sentiment." },
    { "_id": "vec27", "chunk_text": "Tech IPOs that disappointed in their first year." },
    { "_id": "vec28", "chunk_text": "Impact of interest rate hikes on bank stocks." },
    { "_id": "vec29", "chunk_text": "Growth vs. value investing strategies in 2024." },
    { "_id": "vec30", "chunk_text": "The role of artificial intelligence in quantitative trading." },
    { "_id": "vec31", "chunk_text": "What are the implications of quantitative tightening on equities?" },
    { "_id": "vec32", "chunk_text": "How does compounding interest affect long-term investments?" },
    { "_id": "vec33", "chunk_text": "What are the best assets to hedge against inflation?" },
    { "_id": "vec34", "chunk_text": "Apple will spend more than $500 billion in the U.S. over the next four years" },
    { "_id": "vec35", "chunk_text": "Unemployment hit at 2.4% in Q3 of 2024." },
    { "_id": "vec36", "chunk_text": "Unemployment is expected to hit 2.5% in Q3 of 2024." },
    { "_id": "vec37", "chunk_text": "In Q3 2025 unemployment for the prior year was revised to 2.2%"},
    { "_id": "vec38", "chunk_text": "Emerging markets witnessed increased foreign direct investment as global interest rates stabilized." },
    { "_id": "vec39", "chunk_text": "The rise in energy prices significantly impacted inflation trends during the first half of 2024." },
    { "_id": "vec40", "chunk_text": "Labor market trends show a declining participation rate despite record low unemployment in 2024." },
    { "_id": "vec41", "chunk_text": "Forecasts of global supply chain disruptions eased in late 2024, but consumer prices remained elevated due to persistent demand." },
    { "_id": "vec42", "chunk_text": "Tech sector layoffs in Q3 2024 have reshaped hiring trends across high-growth industries." },
    { "_id": "vec43", "chunk_text": "The U.S. dollar weakened against a basket of currencies as the global economy adjusted to shifting trade balances." },
    { "_id": "vec44", "chunk_text": "Central banks worldwide increased gold reserves to hedge against geopolitical and economic instability." },
    { "_id": "vec45", "chunk_text": "Corporate earnings in Q4 2024 were largely impacted by rising raw material costs and currency fluctuations." },
    { "_id": "vec46", "chunk_text": "Economic recovery in Q2 2024 relied heavily on government spending in infrastructure and green energy projects." },
    { "_id": "vec47", "chunk_text": "The housing market saw a rebound in late 2024, driven by falling mortgage rates and pent-up demand." },
    { "_id": "vec48", "chunk_text": "Wage growth outpaced inflation for the first time in years, signaling improved purchasing power in 2024." },
    { "_id": "vec49", "chunk_text": "China's economic growth in 2024 slowed to its lowest level in decades due to structural reforms and weak exports." },
    { "_id": "vec50", "chunk_text": "AI-driven automation in the manufacturing sector boosted productivity but raised concerns about job displacement." },
    { "_id": "vec51", "chunk_text": "The European Union introduced new fiscal policies in 2024 aimed at reducing public debt without stifling growth." },
    { "_id": "vec52", "chunk_text": "Record-breaking weather events in early 2024 have highlighted the growing economic impact of climate change." },
    { "_id": "vec53", "chunk_text": "Cryptocurrencies faced regulatory scrutiny in 2024, leading to volatility and reduced market capitalization." },
    { "_id": "vec54", "chunk_text": "The global tourism sector showed signs of recovery in late 2024 after years of pandemic-related setbacks." },
    { "_id": "vec55", "chunk_text": "Trade tensions between the U.S. and China escalated in 2024, impacting global supply chains and investment flows." },
    { "_id": "vec56", "chunk_text": "Consumer confidence indices remained resilient in Q2 2024 despite fears of an impending recession." },
    { "_id": "vec57", "chunk_text": "Startups in 2024 faced tighter funding conditions as venture capitalists focused on profitability over growth." },
    { "_id": "vec58", "chunk_text": "Oil production cuts in Q1 2024 by OPEC nations drove prices higher, influencing global energy policies." },
    { "_id": "vec59", "chunk_text": "The adoption of digital currencies by central banks increased in 2024, reshaping monetary policy frameworks." },
    { "_id": "vec60", "chunk_text": "Healthcare spending in 2024 surged as governments expanded access to preventive care and pandemic preparedness." },
    { "_id": "vec61", "chunk_text": "The World Bank reported declining poverty rates globally, but regional disparities persisted." },
    { "_id": "vec62", "chunk_text": "Private equity activity in 2024 focused on renewable energy and technology sectors amid shifting investor priorities." },
    { "_id": "vec63", "chunk_text": "Population aging emerged as a critical economic issue in 2024, especially in advanced economies." },
    { "_id": "vec64", "chunk_text": "Rising commodity prices in 2024 strained emerging markets dependent on imports of raw materials." },
    { "_id": "vec65", "chunk_text": "The global shipping industry experienced declining freight rates in 2024 due to overcapacity and reduced demand." },
    { "_id": "vec66", "chunk_text": "Bank lending to small and medium-sized enterprises surged in 2024 as governments incentivized entrepreneurship." },
    { "_id": "vec67", "chunk_text": "Renewable energy projects accounted for a record share of global infrastructure investment in 2024." },
    { "_id": "vec68", "chunk_text": "Cybersecurity spending reached new highs in 2024, reflecting the growing threat of digital attacks on infrastructure." },
    { "_id": "vec69", "chunk_text": "The agricultural sector faced challenges in 2024 due to extreme weather and rising input costs." },
    { "_id": "vec70", "chunk_text": "Consumer spending patterns shifted in 2024, with a greater focus on experiences over goods." },
    { "_id": "vec71", "chunk_text": "The economic impact of the 2008 financial crisis was mitigated by quantitative easing policies." },
    { "_id": "vec72", "chunk_text": "In early 2024, global GDP growth slowed, driven by weaker exports in Asia and Europe." },
    { "_id": "vec73", "chunk_text": "The historical relationship between inflation and unemployment is explained by the Phillips Curve." },
    { "_id": "vec74", "chunk_text": "Apple prices first bond offering in 2 years" },
    { "_id": "vec75", "chunk_text": "The collapse of Silicon Valley Bank raised questions about regulatory oversight in 2024." },
    { "_id": "vec76", "chunk_text": "The cost of living crisis has been exacerbated by stagnant wage growth and rising inflation." },
    { "_id": "vec77", "chunk_text": "Supply chain resilience became a top priority for multinational corporations in 2024." },
    { "_id": "vec78", "chunk_text": "Consumer sentiment surveys in 2024 reflected optimism despite high interest rates." },
    { "_id": "vec79", "chunk_text": "The resurgence of industrial policy in Q1 2024 focused on decoupling critical supply chains." },
    { "_id": "vec80", "chunk_text": "Technological innovation in the fintech sector disrupted traditional banking in 2024." },
    { "_id": "vec81", "chunk_text": "Apple to pay $95 million to settle lawsuit accusing Siri of eavesdropping." },
    { "_id": "vec82", "chunk_text": "Renewable energy subsidies in 2024 reduced the global reliance on fossil fuels." },
    { "_id": "vec83", "chunk_text": "The economic fallout of geopolitical tensions was evident in rising defense budgets worldwide." },
    { "_id": "vec84", "chunk_text": "The IMF's 2024 global outlook highlighted risks of stagflation in emerging markets." },
    { "_id": "vec85", "chunk_text": "Declining birth rates in advanced economies pose long-term challenges for labor markets." },
    { "_id": "vec86", "chunk_text": "Digital transformation initiatives in 2024 drove productivity gains in the services sector." },
    { "_id": "vec87", "chunk_text": "The U.S. labor market's resilience in 2024 defied predictions of a severe recession." },
    { "_id": "vec88", "chunk_text": "New fiscal measures in the European Union aimed to stabilize debt levels post-pandemic." },
    { "_id": "vec89", "chunk_text": "Venture capital investments in 2024 leaned heavily toward AI and automation startups." },
    { "_id": "vec90", "chunk_text": "The surge in e-commerce in 2024 was facilitated by advancements in logistics technology." },
    { "_id": "vec91", "chunk_text": "The impact of ESG investing on corporate strategies has been a major focus in 2024." },
    { "_id": "vec92", "chunk_text": "Income inequality widened in 2024 despite strong economic growth in developed nations." },
    { "_id": "vec93", "chunk_text": "The collapse of FTX highlighted the volatility and risks associated with cryptocurrencies." },
    { "_id": "vec94", "chunk_text": "Cyberattacks targeting financial institutions in 2024 led to record cybersecurity spending." },
    { "_id": "vec95", "chunk_text": "Latest Data Shows Another Bad News For Apple (AAPL)" },
    { "_id": "vec96", "chunk_text": "New trade agreements signed 2021 will make an impact in 2023"},
]

### Upserting data into the Pinecone index

Here, we embed and upsert the data into Pinecone. During the upsert process, a vector embedding is created for each record using the embedding model we specified on index creation. These vector embeddings are then stored in the index with any additional info, or metadata, we specify. Read more about metadata [here](https://docs.pinecone.io/guides/index-data/indexing-overview#metadata).

We specify a namespace called "headlines", which is a higher level unit of organization when interacting with Pinecone but has some important benefits. Querying by namespace performs a sort of broad filter to only records that exist in that namespace. This can be used for isolating customer data for multi-tenancy. And when you divide records into namespaces in a logical way, you speed up queries by ensuring only relevant records are scanned. You can learn more about namespaces [here](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces).

In [102]:
namespace = "headlines"
index.upsert_records(records = data, namespace = namespace)

In [103]:
index.describe_index_stats()

{'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'headlines': {'vector_count': 96}},
 'total_vector_count': 96,
 'vector_type': 'sparse'}

## 3. Running queries


Now that our index is populated we can begin making queries.

Since we're using Pinecone's integrated embedding, we can query our sparse index with the text we want to search for. The search query is vectorized using the same embedding model we specified prior and then only the most relevant documents (based on dotproduct score) are returned.

In [80]:
search_query = "2023 financial data about Apple"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

print_hits(results)

id: vec1, score: 8.68 text: Apple Inc. issued a $10 billion corporate bond in 2023.
id: vec95, score: 6.08 text: Latest Data Shows Another Bad News For Apple (AAPL)
id: vec34, score: 5.26 text: Apple will spend more than $500 billion in the U.S. over the next four years
id: vec74, score: 4.39 text: Apple prices first bond offering in 2 years
id: vec81, score: 4.21 text: Apple to pay $95 million to settle lawsuit accusing Siri of eavesdropping.
id: vec71, score: 3.2 text: The economic impact of the 2008 financial crisis was mitigated by quantitative easing policies.
id: vec96, score: 3.03 text: New trade agreements signed 2021 will make an impact in 2023
id: vec94, score: 2.81 text: Cyberattacks targeting financial institutions in 2024 led to record cybersecurity spending.
id: vec50, score: 0.51 text: AI-driven automation in the manufacturing sector boosted productivity but raised concerns about job displacement.
id: vec75, score: 0.49 text: The collapse of Silicon Valley Bank raised qu

### Rerank the results

We can use a reranking model to score the results based on their semantic relevance to the query and return a new, more accurate ranking.

In [99]:
results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    },
    rerank={
        "model": "bge-reranker-v2-m3",
        "rank_fields": ["chunk_text"]
    }
)

print_hits(results)

id: vec1, score: 0.8 text: Apple Inc. issued a $10 billion corporate bond in 2023.
id: vec34, score: 0.03 text: Apple will spend more than $500 billion in the U.S. over the next four years
id: vec95, score: 0.02 text: Latest Data Shows Another Bad News For Apple (AAPL)
id: vec96, score: 0.0 text: New trade agreements signed 2021 will make an impact in 2023
id: vec74, score: 0.0 text: Apple prices first bond offering in 2 years
id: vec81, score: 0.0 text: Apple to pay $95 million to settle lawsuit accusing Siri of eavesdropping.
id: vec94, score: 0.0 text: Cyberattacks targeting financial institutions in 2024 led to record cybersecurity spending.
id: vec75, score: 0.0 text: The collapse of Silicon Valley Bank raised questions about regulatory oversight in 2024.
id: vec71, score: 0.0 text: The economic impact of the 2008 financial crisis was mitigated by quantitative easing policies.
id: vec50, score: 0.0 text: AI-driven automation in the manufacturing sector boosted productivity but rai


You may get fewer than top_k results if top_k is larger than the number of sparse vectors in your index that match your query. That is, any vectors where the dotproduct score is 0 will be discarded.

Searching our financial headlines for the word "banana" will return zero results becuase our data doesn't include the term "banana" at all. For example:

In [84]:
search_query = "banana"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

print_hits(results)

No results found


## 4. Demo cleanup

When you're done, delete the index to save resources.

Congrats, you've just implemented lexical search with Pinecone!


In [None]:
pc.delete_index(name=index_name)

---