[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/cohere-rerank/compaq-reranking.ipynb)

# Pinecone & Cohere Rerank
This notebook downloads and processes Compaq's 10-K SEC filings, using Pinecone vector search and Cohere Rerank 3.5 to extract targeted insights from the documents. The two-stage retrieval combines efficient vector search with semantic reranking for improved results.

In [1]:
%pip install --quiet --upgrade \
    pinecone-notebooks \
    python-dotenv \
    pinecone \
    pinecone-plugin-records \
    "rich[jupyter]"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/427.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.3/427.3 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/267.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.4/267.4 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Standard library imports
import os
import re
import time
import getpass
import difflib
from html import unescape

# Third-party imports
import requests
from tqdm import tqdm
from dotenv import load_dotenv
import pandas as pd
from IPython.display import display, HTML
import ipywidgets as widgets

# Pinecone and Rich libraries
from pinecone import Pinecone
from rich import print as rprint
from rich.console import Console
from rich.table import Table
from rich import box

# Data Acquisition
Download Compaq's 10-K filings from the SEC Edgar database.

In [3]:
# The common base URL for SEC EDGAR filings
BASE_URL = "https://www.sec.gov/Archives/edgar/data/714154/"

# Each year's unique URL tail
URL_TAILS = {
    "1994": "0000714154-94-000005.txt",
    "1995": "0000714154-95-000007.txt",
    "1996": "0000714154-96-000006.txt",
    "1997": "0000950129-97-000786.txt",
    "1998": "0001015402-98-000028.txt",
    "1999": "0001015402-99-000214.txt",
    "2000": "000091205700008116/0000912057-00-008116.txt",
    "2001": "000089056601000112/0000890566-01-000112.txt",
    "2002": "000095012902000383/0000950129-02-000383.txt",
}

# Prompt for company name and email, with the email input hidden
user_company = input("Enter your company name: ")
user_email = getpass.getpass("Enter your email address (input hidden): ")
user_agent_declaration = f"{user_company} {user_email}"

headers = {
    "User-Agent": user_agent_declaration,
    "Accept-Encoding": "gzip, deflate",
    "Host": "www.sec.gov"
}

def download_10ks(url_tails: dict, base_url: str, download_dir: str = "data") -> None:
    """
    Download SEC 10-K filings for each specified year.

    Args:
        url_tails (dict): Mapping of year to file tail URL.
        base_url (str): The base URL for filings.
        download_dir (str): Directory to save downloaded files.
    """
    os.makedirs(download_dir, exist_ok=True)

    for year, tail in tqdm(url_tails.items(), desc="Downloading filings", unit="file"):
        url = f"{base_url}{tail}"
        response = requests.get(url, headers=headers)
        file_path = os.path.join(download_dir, f"{year}.txt")
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        time.sleep(1)


download_10ks(URL_TAILS, BASE_URL)


Enter your company name: Pinecone
Enter your email address (input hidden): ··········


Downloading filings: 100%|██████████| 9/9 [00:10<00:00,  1.15s/file]


In [4]:
texts = {
    year: open(f"data/{year}.txt", encoding="utf-8").read()
    for year in URL_TAILS.keys()
}

# Data Preprocessing
Remove unwanted tags, extract tables, and split the data into manageable chunks.

In [5]:
def remove_unwanted_tags(text: str, keep_tags: list = None) -> str:
    """
    Remove all HTML/XML tags from text except for specified tags to keep.

    Args:
        text (str): Input text containing tags.
        keep_tags (list, optional): List of tag names to retain (without angle brackets).
                                    Defaults to ["TABLE"].

    Returns:
        str: Cleaned text with unwanted tags removed.
    """
    if keep_tags is None:
        keep_tags = ["TABLE"]
    keep_pattern = '|'.join(keep_tags)
    pattern = re.compile(rf'<(?!/?(?:{keep_pattern})\b)[^>]+>')
    return pattern.sub('', text)


def parse_10ks(year: str, document_text: str) -> str:
    """
    Extract and clean the text enclosed within <TEXT> tags of a 10-K filing.

    Args:
        year (str): The filing year.
        document_text (str): Raw text of the filing.

    Returns:
        str: Cleaned text within the <TEXT> tags, or an empty string if not found.
    """
    pattern = re.compile(r"<TEXT>(.*?)</TEXT>", re.DOTALL)
    matches = pattern.findall(document_text)
    if not matches:
        print(f"Document text not found for year {year}")
        return ""
    return remove_unwanted_tags(matches[0])

## Chunking Data by Tables
This section splits the document text into table sections and non-table text.

In [6]:
def chunk_by_table(document_text: str) -> dict:
    """
    Split document text into table and non-table parts.

    Args:
        document_text (str): Raw document text.

    Returns:
        dict: Dictionary with keys 'tables' (list of table sections)
              and 'texts' (concatenated non-table text).
    """
    chunks = {"tables": [], "texts": []}
    table_pattern = re.compile(r'(<TABLE.*?</TABLE>)', re.DOTALL | re.IGNORECASE)
    parts = table_pattern.split(document_text)

    for part in parts:
        if not part.strip():
            continue
        if table_pattern.match(part):
            chunks["tables"].append(part)
        else:
            chunks["texts"].append(part)

    chunks["texts"] = "".join(chunks["texts"])
    return chunks

test_table_chunking = chunk_by_table(texts['1996'])

print((test_table_chunking["tables"][0]))

<TABLE> <S> <C>

<ARTICLE> 5
<LEGEND>
THIS SCHEDULE CONTAINS SUMMARY FINANCIAL INFORMATION EXTRACTED FROM
COMPAQ COMPUTER CORPORATION'S CONSOLIDATED BALANCE SHEET AND CONSOLIDATED
STATEMENT OF INCOME FOR THE PERIOD ENDED DECEMBER 31, 1995 AND IS QUAILIFIED
IN ITS ENTIRETY BY REFERENCE TO SUCH FINANCIAL STATEMENTS.
</LEGEND>
<MULTIPLIER> 1,000,000
       
<S>                             <C>
<PERIOD-TYPE>                   YEAR
<FISCAL-YEAR-END>                          DEC-31-1995
<PERIOD-END>                               DEC-31-1995
<CASH>                                             745
<SECURITIES>                                         0
<RECEIVABLES>                                    3,241
<ALLOWANCES>                                       100
<INVENTORY>                                      2,156
<CURRENT-ASSETS>                                 6,527
<PP&E>                                           1,981
<DEPRECIATION>                                     871
<TOTAL-ASSETS>      

In [7]:
test_table_chunking = chunk_by_table(texts['1996'])

parse_10ks(1996, test_table_chunking["texts"])



In [8]:
# split into table/not table

split_by_table = {
    year: chunk_by_table(text)
    for year, text in texts.items()
}

In [9]:
# clean the non-table bits

clean_texts = {
    year: {"tables": chunk_data["tables"],
           "texts": parse_10ks(year, chunk_data["texts"])}
    for year, chunk_data in split_by_table.items()
}


## Text Chunking

In [10]:
CHAR_CHUNKS = 1500

def chunk_text(text: str, chunk_size: int) -> list:
    """
    Split text into chunks of a specified size.

    Args:
        text (str): The text to be split.
        chunk_size (int): Maximum number of characters per chunk.

    Returns:
        list: A list of text chunks.
    """
    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

# chunk the texts, ignore the tables

chunked_texts = {
    year: {"tables": chunk_data["tables"],
           "texts": chunk_text(chunk_data["texts"], CHAR_CHUNKS)}
    for year, chunk_data in clean_texts.items()
}


In [11]:
print(chunked_texts['1995']['tables'])

["<TABLE> <S> <C>\n\n<PAGE>\n<ARTICLE> 5\n<LEGEND>\nTHIS SCHEDULE CONTAINS SUMMARY FINANCIAL INFORMATION EXTRACTED FROM\nCOMPAQ COMPUTER CORPORATION'S CONSOLIDATED BALANCE SHEET AND CONSOLIDATED\nSTATEMENT OF INCOME FOR THE PERIOD ENDED DECEMBER 31, 1994 AND IS QUAILIFIED\nIN ITS ENTIRETY BY REFERENCE TO SUCH FINANCIAL STATEMENTS.\n</LEGEND>\n<MULTIPLIER> 1,000,000\n       \n<S>                             <C>\n<PERIOD-TYPE>                   YEAR\n<FISCAL-YEAR-END>                          DEC-31-1994\n<PERIOD-END>                               DEC-31-1994\n<CASH>                                             471\n<SECURITIES>                                         0\n<RECEIVABLES>                                    2,362\n<ALLOWANCES>                                        75\n<INVENTORY>                                      2,005\n<CURRENT-ASSETS>                                 5,158\n<PP&E>                                           1,672\n<DEPRECIATION>                             

# Indexing with Pinecone
## Setting up Pinecone

In [12]:
def is_colab():
    try:
        import google.colab
        return True
    except ImportError:
        return False

# Load API key based on environment
if is_colab():
    from google.colab import userdata
    api_key = userdata.get('PINECONE_API_KEY')
else:
    # Local environment - load from .env or environment variable
    load_dotenv()
    api_key = os.environ.get("PINECONE_API_KEY")

if not api_key:
    raise ValueError("PINECONE_API_KEY not found")

pc = Pinecone(api_key=api_key)
index_name = "reranking-compaq"

In [13]:
# Set up index:
if not pc.has_index(index_name):
    index_model = pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model": "multilingual-e5-large",
            "field_map": {"text": "chunk_text"}
        }
    )

# Then get the index reference for operations
index = pc.Index(index_name)

## Data Transformation for Indexing

In [14]:
# Target the created index for upsert and search

index = pc.Index(index_name)

# reformat into records, make the ids for each chunk relate to the documentyear

base_string = "compaq"
# then, we add the document year and the chunk number to the base_string to create unique id

all_text_records = []
all_table_records = []

for year, chunks in chunked_texts.items():
    text_records = [
        {
            "chunk_text": chunk,
            "_id": f"{base_string}_{year}#text{num}",
            "filed_year": int(year),
            "chunk_type": "text"
        }
        for num, chunk in enumerate(chunks["texts"])
    ]
    table_records = [
        {
            "chunk_text": chunk,
            "_id": f"{base_string}_{year}#table{num}",
            "filed_year": int(year),
            "chunk_type": "table"
        }
        for num, chunk in enumerate(chunks["tables"])
    ]
    all_text_records.extend(text_records)
    all_table_records.extend(table_records)

## Upserting Data to Pinecone

In [15]:
# we need to upsert in batches of 96 due to limitations of the integrated inference endpoint

def upsert_in_batches(records, batch_size=96):
    for i in range(0, len(records), batch_size):
        batch = records[i:i+batch_size]
        # Extract the record IDs for this batch
        record_ids = [record["_id"] for record in batch]

        # Fetch existing records from the index for these IDs.
        # (Assumes that the returned dict contains a "vectors" key with the current records.)
        existing = index.fetch(ids=record_ids, namespace=base_string)
        existing_ids = set(existing.get("vectors", {}).keys())

        # Filter out records that already exist.
        records_to_upsert = [record for record in batch if record["_id"] not in existing_ids]

        if records_to_upsert:
            index.upsert_records(namespace=base_string, records=records_to_upsert)
            time.sleep(1)

upsert_in_batches(all_text_records)

# Tables exceed size limits.
# upsert_in_batches(all_table_records)

# Querying the Index

In [16]:
ranked_results = index.search_records(
    namespace="compaq",
    query={
        "inputs": {"text": "risks to compaq for dollar hedging"},
        "top_k": 20
    },
    rerank={
        "model": "cohere-rerank-3.5",
        "top_n": 3,
        "rank_fields": ["chunk_text"]
    },
    fields=["chunk_text"]
)

hits = ranked_results["result"]["hits"]

for hit in hits:
    chunk = hit['fields']['chunk_text']
    chunk = unescape(chunk)  # Converts &nbsp; to actual non-breaking spaces
    rprint(chunk)

In [17]:
def query_db(query: str, embed_with_rerank: bool = False, filter: dict = None) -> list:
    """
    Query the Pinecone index with an optional rerank.

    Args:
        query (str): The query string.
        embed_with_rerank (bool, optional): Whether to apply reranking. Defaults to False.
        filter (dict, optional): Filter criteria for the query. Defaults to None.

    Returns:
        list: A list of hit records from the index.
    """
    top_k = 20 if embed_with_rerank else 3
    q = {"inputs": {"text": query}, "top_k": top_k}
    if filter is not None:
        q["filter"] = filter

    if embed_with_rerank:
        ranked_results = index.search_records(
            namespace="compaq",
            query=q,
            rerank={
                "model": "cohere-rerank-3.5",
                "top_n": 3,
                "rank_fields": ["chunk_text"]
            },
            fields=["chunk_text"]
        )
    else:
        ranked_results = index.search_records(
            namespace="compaq",
            query=q,
            fields=["chunk_text"]
        )
    return ranked_results["result"]["hits"]


## Formatting query results with Rich

We can use Rich to display the results of our queries in a really clean fashion.

In [27]:
def query_with_formatting(query: str) -> Table:
    """
    Execute a query and format the results into a Rich Table.

    Args:
        query (str): The query string.

    Returns:
        Table: A Rich Table object with side-by-side results.
    """
    without_hits = query_db(query, embed_with_rerank=False, filter={"chunk_type": "text"})
    with_hits = query_db(query, embed_with_rerank=True, filter={"chunk_type": "text"})

    without_texts = [hit['fields']['chunk_text'] for hit in without_hits]
    with_texts = [hit['fields']['chunk_text'] for hit in with_hits]

    table = Table(title=f"Results with query: {query}", show_lines=True)
    table.add_column("Result #", justify="left", style="black", no_wrap=True, width=75)
    table.add_column("Without Rerank", justify="left", style="black", width=500)
    table.add_column("With Rerank", style="blue", justify="left", width=500)

    for i, (no_rerank, with_rerank_text) in enumerate(zip(without_texts, with_texts), start=1):
        table.add_row(str(i), unescape(no_rerank), unescape(with_rerank_text))

    return table

## Simple Query


This query allows us to pull data bout what Compaq sells, and how it corresponds to sales.

In the returned results, we see the Rerank endpoint returning chunks that are more concerned with changes in the product offerings, but not sales, where as the endpoint without rerank is returning sales numbers. Let's see if we can impact this a bit better.

In [28]:
QUERY = "Changes in Compaq's product offerings and their impacts on sales"
result = query_with_formatting(QUERY)
console = Console()
console.print(result)

## Numerical Reasoning Capability

Let's take our query and add a requirement on sales numbers. Normal reranking models (and embedding models) cannot handle this level of granularity, so lets see how the Rerank endpoint performs.


In [29]:
QUERY = "Changes in Compaq's product offerings and their impacts on sales"
numerical = " and show changes greater than 30% in sales"
result = query_with_formatting(QUERY+numerical)
console = Console()
console.print(result)

We can see that the rerank endpoint returns chunks that analyze sales changes in excess of 30%, whereas without reranking we don't get these figures!

## Negation Capabilities

The Rerank endpoint is also excellent at taking exclusion requirements for returned queries into account. This is not a typical capability for LLMs, and is often cited as a hinderance for reasoning tasks.

Let's try to expose this reasoning by asking about the acquisitions in Compaq history:

In [30]:
base_query = "largest acquisitions in compaq history"
result = query_with_formatting(base_query)
console = Console()
console.print(result)

In the above chunks, we see that Compaq acquired a lot of firms, the largest being Digital Equipment Corporation. Lets see if we can garner analysis from the 10-ks that excludes this company.


In [31]:
negation = " that is not Digital Equipment Corporation"
result = query_with_formatting(base_query + negation)
console = Console()
console.print(result)

Interestingly , the returned chunks exclude the purchase of Digital in 1998 when using the reranker, whereas without the reranker we maintain references to Digital Equipment Corporation.

Although this may depend generally on chunking strategy, and the content of text being written about, we can still cleanly demonstrate this behavior without knowing exactly what chunks contain which information.

## Temporal Reasoning

The Rerank endpoint is great at using times, dates, and other forms of temporal information in refining results.

Here, we specify products sold in 1999, despite not explicitly filtering on 1999 10-Ks in metadata!

In [32]:
QUERY = "Consumer products that compaq started selling in 1999 and beyond"
result = query_with_formatting(QUERY)
console = Console()
console.print(result)