## hybrid_search_pinecone.ipynb

Jay Urbain, PhD

6/4/2024, 2/20/2025

References:   
https://www.pinecone.io/learn/hybrid-search-intro/    


In [1]:
# !pip install datasets 
# !pip install transformers  
# !pip install pinecone 
# !pip install pinecone-text
# !pip install sentence_transformers

In [21]:
import os

PINECONE_API_KEY = "xxx" or getpass("Pinecone API key: ")

Load the HuggingFace [pubmed_qa](https://huggingface.co/datasets/pubmed_qa)

Dictionary format.

In [2]:
from datasets import load_dataset  # !pip install datasets
pubmed = load_dataset(
   'pubmed_qa',
   'pqa_labeled',
   split='train'
)
pubmed

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 1000
})

The context feature is what we will store in the vector database. 

Each context record contains multiple contexts within a list. 

Join contexts within a record to create larger, more meaningful contexts.

Note: The highly technical and domain specific language.  Good candidate for hybrid search.

In [3]:
contexts = []
# loop through the context passages
for record in pubmed['context']:
   # join context passages for each question and append to contexts list
   contexts.append('\n'.join(record['contexts']))
# view some of the contexts
for context in contexts[:2]:
   print(f"{context[:300]}...")

Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cel...
Assessment of visual acuity depends on the optotypes used for measurement. The ability to recognize different optotypes differs even if their critical details appear under the same visual angle. Since optotypes are evaluated on individuals with good visual acuity and without eye disorders, differenc...


### Sparse Vectors

Use Bert tokenizer for sparse embeddings.

Note: There are several methods exist for building sparse vector embeddings, from the latest sparse embedding transformer models like SPLADE to rule-based tokenization logic.

In [4]:
from transformers import BertTokenizerFast  # !pip install transformers

# load bert tokenizer from huggingface
tokenizer = BertTokenizerFast.from_pretrained(
   'bert-base-uncased'
)

In [5]:
type( contexts )

list

Try tokenizing a single context.

The output from this includes a few arrays.


In [6]:
# tokenize the context passage
inputs = tokenizer(
   contexts[0], padding=True, truncation=True,
   max_length=512
)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [7]:
#inputs

input_ids represent a unique work or sub-word token translated into integer ID values. 

In [8]:
input_ids = inputs['input_ids']
input_ids

[101,
 16984,
 3526,
 2331,
 1006,
 7473,
 2094,
 1007,
 2003,
 1996,
 12222,
 2331,
 1997,
 4442,
 2306,
 2019,
 15923,
 1012,
 1996,
 12922,
 3269,
 1006,
 9706,
 17175,
 18150,
 2239,
 11934,
 27806,
 1007,
 7137,
 2566,
 29278,
 10708,
 1999,
 2049,
 3727,
 2083,
 7473,
 2094,
 1012,
 1996,
 3727,
 1997,
 1996,
 3269,
 8676,
 1997,
 1037,
 17779,
 6198,
 1997,
 20134,
 1998,
 18323,
 9607,
 4372,
 20464,
 18606,
 2024,
 29111,
 1012,
 7473,
 2094,
 5158,
 1999,
 1996,
 4442,
 2012,
 1996,
 2415,
 1997,
 2122,
 2024,
 29111,
 1998,
 22901,
 15436,
 2015,
 1010,
 7458,
 3155,
 2274,
 4442,
 2013,
 1996,
 12436,
 28817,
 20051,
 5397,
 1012,
 1996,
 2535,
 1997,
 10210,
 11663,
 15422,
 4360,
 2076,
 7473,
 2094,
 2038,
 2042,
 3858,
 1999,
 4176,
 1025,
 2174,
 1010,
 2009,
 2038,
 2042,
 2625,
 3273,
 2076,
 7473,
 2094,
 1999,
 4264,
 1012,
 1996,
 2206,
 3259,
 3449,
 14194,
 8524,
 4570,
 1996,
 2535,
 1997,
 23079,
 10949,
 2076,
 13908,
 2135,
 12222,
 7473,
 2094,
 1999,
 2426

Input IDs represent a unique word or sub-word token translated into integer ID values. This transformation is done using the BERT tokenizer’s rule-based tokenization logic.

Pinecone expects to receive sparse vectors in dictionary format. For example, the vector: 

[0, 2, 9, 2, 5, 5]

Would become:

{ "0": 1, "2": 2, "5": 2, "9": 1 }

Each token is represented by a single key in the dictionary, and its frequency is counted by the respective key-value. 

Inverted index style.

The collections Counter class handles this nicely for us.

In [9]:
from collections import Counter

# convert the input_ids list to a dictionary of key to frequency values
sparse_vec = dict(Counter(input_ids))
sparse_vec

{101: 1,
 16984: 1,
 3526: 2,
 2331: 2,
 1006: 10,
 7473: 13,
 2094: 13,
 1007: 10,
 2003: 2,
 1996: 13,
 12222: 2,
 1997: 13,
 4442: 7,
 2306: 2,
 2019: 1,
 15923: 1,
 1012: 14,
 12922: 2,
 3269: 3,
 9706: 1,
 17175: 1,
 18150: 1,
 2239: 1,
 11934: 2,
 27806: 2,
 7137: 1,
 2566: 3,
 29278: 2,
 10708: 2,
 1999: 11,
 2049: 1,
 3727: 4,
 2083: 1,
 8676: 1,
 1037: 8,
 17779: 1,
 6198: 1,
 20134: 1,
 1998: 7,
 18323: 1,
 9607: 1,
 4372: 1,
 20464: 2,
 18606: 1,
 2024: 3,
 29111: 2,
 5158: 1,
 2012: 1,
 2415: 1,
 2122: 2,
 22901: 1,
 15436: 1,
 2015: 1,
 1010: 7,
 7458: 1,
 3155: 1,
 2274: 1,
 2013: 1,
 12436: 1,
 28817: 1,
 20051: 1,
 5397: 1,
 2535: 2,
 10210: 2,
 11663: 1,
 15422: 1,
 4360: 1,
 2076: 4,
 2038: 2,
 2042: 2,
 3858: 1,
 4176: 1,
 1025: 2,
 2174: 1,
 2009: 1,
 2625: 1,
 3273: 1,
 4264: 1,
 2206: 1,
 3259: 1,
 3449: 1,
 14194: 1,
 8524: 1,
 4570: 1,
 23079: 6,
 10949: 3,
 13908: 1,
 2135: 1,
 24269: 2,
 2309: 1,
 9890: 1,
 3332: 2,
 2754: 2,
 7053: 1,
 10066: 1,
 2001: 2,
 40

Functions:  
`build_dict`transforms input IDs into dictionaries     
`generate_sparse_vectors` handles the tokenization and dictionary creation.

In [10]:
def build_dict(input_batch):
   # store a batch of sparse embeddings
   sparse_emb = []
   # iterate through input batch
   for token_ids in input_batch:
       indices = []
       values = []
       # convert the input_ids list to a dictionary of key to frequency values
       d = dict(Counter(token_ids))
       for idx in d:
            indices.append(idx)
            values.append(d[idx])
       sparse_emb.append({'indices': indices, 'values': values})
   # return sparse_emb list
   return sparse_emb


def generate_sparse_vectors(context_batch):
   # create batch of input_ids
   inputs = tokenizer(
           context_batch, padding=True,
           truncation=True,
           max_length=512, special_tokens=False
   )['input_ids']
   # create sparse dictionaries
   sparse_embeds = build_dict(inputs)
   return sparse_embeds

### Dense vectors

Generates SentenceTransformer dense vectors of length 384.

The model gives us a 384 dimensional dense vector. We can move on to upserting the full dataset with both sparse and dense vectors.

In [11]:
# !pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# load a sentence transformer model from huggingface
model = SentenceTransformer(
   'multi-qa-MiniLM-L6-cos-v1'
)

emb = model.encode(contexts[0])
emb.shape

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


(384,)

In [12]:
contexts[0]

'Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.\nThe following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoTracker Red CMXRos 

In [13]:
emb

array([-2.68844105e-02, -3.93753313e-02,  5.83890267e-02, -7.56208673e-02,
        7.43441507e-02, -3.60408761e-02, -4.25832830e-02,  7.23779500e-02,
        4.80940081e-02,  5.64286858e-02,  8.95968527e-02,  5.98005429e-02,
       -6.41748682e-02, -3.01561039e-02, -1.51890174e-01, -2.67446204e-03,
       -2.43532285e-02,  2.98274867e-02, -1.97161641e-02,  8.44248086e-02,
        2.94119474e-02,  1.97721012e-02,  5.52950129e-02, -4.99724187e-02,
       -1.67490747e-02,  4.65615802e-02, -1.15130581e-02, -9.48361084e-02,
        1.28917480e-02, -1.90359261e-02, -3.58578637e-02,  7.32988119e-02,
       -5.13357706e-02,  5.64932153e-02,  4.69889939e-02,  5.58662899e-02,
       -1.76586974e-02, -1.17624383e-02, -2.25632992e-02,  5.57572814e-03,
        3.22723687e-02, -2.58667264e-02, -5.76049127e-02,  6.59970716e-02,
       -1.14573501e-02, -1.58389527e-02, -2.36234497e-02, -5.07370420e-02,
       -1.20874919e-01, -1.60646066e-02,  6.38828264e-04, -7.89955333e-02,
       -1.48448236e-02,  

### Pinecond sparse-dense index

In [15]:
from pinecone import Pinecone
pc = Pinecone(
   api_key=PINECONE_API_KEY,  # app.pinecone.io
)
pc.list_indexes().names() # to check if my index exsist

[]

The process for creating and using a sparse-dense index is almost the same to creating a pure dense index, the only change being that upserts and queries must include an additional sparse_values parameter. 

Initialize the sparse-dense enabled index.

In [16]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

In [17]:
index_name = "hybrid-pubmed"
# pc.delete_index(index_name)

In [18]:
from pinecone import Pinecone
import time

# pc = Pinecone()

# index_name = "hybrid-pubmed"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

print( existing_indexes )

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of ada 002
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()


[]


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [19]:
# help(Pinecone.create_index)

Need to add sparse value field when indexing.

To use a sparse-dense enabled index we must set pod_type to s1 or p1, and metric to use dotproduct.

With all of that ready, we can begin adding all of our data to the hybrid index like so:

In [20]:
from tqdm.auto import tqdm
from pinecone_text.sparse import BM25Encoder

index = pc.Index(index_name)

# Initialize BM25 and fit the corpus.
bm25 = BM25Encoder().default()
bm25.fit(contexts)

batch_size = 32

for i in tqdm(range(0, len(contexts), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(contexts))
    # extract batch
    context_batch = contexts[i:i_end]
    # print( context_batch )
    # create unique IDs
    ids = [str(x) for x in range(i, i_end)]
    # add context passages as metadata
    meta = [{'context': context} for context in context_batch]
    # create dense vectors
    dense_embeds = model.encode(context_batch).tolist()
    # create sparse vectors
    # sparse_embeds = generate_sparse_vectors(context_batch)
    sparse_embeds = bm25.encode_documents(context_batch)

    vectors = []
    # loop through the data and create dictionaries for upserts
    for _id, sparse, dense, metadata in zip(
        ids, sparse_embeds, dense_embeds, meta
    ):
        vectors.append({
            'id': _id,
            'sparse_values': sparse,
            'values': dense,
            'metadata': metadata
        })

    # upload the documents to the new hybrid index
    index.upsert(vectors)

# show index description after uploading the documents
# pinecone.describe_index_stats()
print(index.describe_index_stats())


100% [........................................................................] 65406227 / 65406227

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 992}},
 'total_vector_count': 992}


In [40]:
print(index.describe_index_stats())

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}


### Queries

Queries must include dense and sparse vectors.

Alpha == 1 is all dense vector.
Alpha == 0 is all sparse vector.

In [41]:
def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse


def hybrid_query(question, top_k, alpha):
   # convert the question into a sparse vector
   # sparse_vec = generate_sparse_vectors([question])[0]
   sparse_vec = bm25.encode_documents([question])[0]
   # print(sparse_vec)
   # convert the question into a dense vector
   dense_vec = model.encode([question])[0]
   # print(dense_vec)

   # scale alpha with hybrid_scale
   # print( type(sparse_vec[0]['indices']), type(dense_vec) )
   dense_vec, sparse_vec = hybrid_scale(
      dense_vec, sparse_vec, alpha
   )
   # query pinecone with the query parameters
   result = index.query(
      vector=dense_vec,
      sparse_vector=sparse_vec,
      top_k=top_k,
      include_metadata=True
   )
   # return search results as json
   return result

In [42]:
question = "Can clinicians use the PHQ-9 to assess depression in people with vision loss?"

In [43]:
hybrid_query(question, top_k=3, alpha=1.0)

{'matches': [{'id': '305',
              'metadata': {'context': 'The gap between evidence-based '
                                      'treatments and routine care has been '
                                      'well established. Findings from the '
                                      'Sequenced Treatments Alternatives to '
                                      'Relieve Depression (STAR*D) emphasized '
                                      'the importance of measurement-based '
                                      'care for the treatment of depression as '
                                      'a key ingredient for achieving response '
                                      'and remission; yet measurement-based '
                                      'care approaches are not commonly used '
                                      'in clinical practice.\n'
                                      'The Nine-Item Patient Health '
                                      'Questionnaire (PHQ-

In [44]:
hybrid_query(question, top_k=3, alpha=0.5)

{'matches': [{'id': '711',
              'metadata': {'context': 'To investigate whether the Patient '
                                      'Health Questionnaire-9 (PHQ-9) '
                                      'possesses the essential psychometric '
                                      'characteristics to measure depressive '
                                      'symptoms in people with visual '
                                      'impairment.\n'
                                      'The PHQ-9 scale was completed by 103 '
                                      'participants with low vision. These '
                                      'data were then assessed for fit to the '
                                      'Rasch model.\n'
                                      "The participants' mean +/- standard "
                                      'deviation (SD) age was 74.7 +/- 12.2 '
                                      'years. Almost one half of them (n = 46; '
                

In [46]:
hybrid_query(question, top_k=3, alpha=0.0)

{'matches': [{'id': '711',
              'metadata': {'context': 'To investigate whether the Patient '
                                      'Health Questionnaire-9 (PHQ-9) '
                                      'possesses the essential psychometric '
                                      'characteristics to measure depressive '
                                      'symptoms in people with visual '
                                      'impairment.\n'
                                      'The PHQ-9 scale was completed by 103 '
                                      'participants with low vision. These '
                                      'data were then assessed for fit to the '
                                      'Rasch model.\n'
                                      "The participants' mean +/- standard "
                                      'deviation (SD) age was 74.7 +/- 12.2 '
                                      'years. Almost one half of them (n = 46; '
                

In [47]:
question = "I mad cow disease related to the prion protein?"
hybrid_query(question, top_k=3, alpha=1.0)

{'matches': [{'id': '637',
              'metadata': {'context': 'Although the mechanism of muscle '
                                      'wasting in end-stage renal disease is '
                                      'not fully understood, there is '
                                      'increasing evidence that acidosis '
                                      'induces muscle protein degradation and '
                                      'could therefore contribute to the loss '
                                      'of muscle protein stores of patients on '
                                      'hemodialysis, a prototypical state of '
                                      'chronic metabolic acidosis (CMA). '
                                      'Because body protein mass is controlled '
                                      'by the balance between synthesis and '
                                      'degradation, protein loss can occur as '
                                      '

In [51]:
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

print( existing_indexes )

[]


In [50]:
pc.delete_index('hybrid-pubmed')


In [None]:
pubmed

In [None]:
help(pubmed)

In [52]:
from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list)
print(metrics_list)

  metrics_list = list_metrics()


['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score', 'cer', 'character', 'charcut_mt', 'chrf', 'code_eval', 'comet', 'competition_math', 'confusion_matrix', 'coval', 'cuad', 'exact_match', 'f1', 'frugalscore', 'glue', 'google_bleu', 'indic_glue', 'mae', 'mahalanobis', 'mape', 'mase', 'matthews_correlation', 'mauve', 'mean_iou', 'meteor', 'mse', 'nist_mt', 'pearsonr', 'perplexity', 'poseval', 'precision', 'r_squared', 'recall', 'rl_reliability', 'roc_auc', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'smape', 'spearmanr', 'squad', 'squad_v2', 'super_glue', 'ter', 'trec_eval', 'wer', 'wiki_split', 'xnli', 'xtreme_s', 'Aledade/extraction_evaluation', 'AlhitawiMohammed22/CER_Hu-Evaluation-Metrics', 'Aye10032/loss_metric', 'Bekhouche/NED', 'BucketHeadP65/confusion_matrix', 'BucketHeadP65/roc_curve', 'CZLC/rouge_raw', 'DaliaCaRo/accents_unplugged_eval', 'DarrenChensformer/action_generation', 'DarrenChensformer/eval_keyphrase', 'DarrenChensformer/relation_extraction', 'DemAI-Lab-UCF/Sem-nCG

In [53]:
pubmed.info.features

{'pubid': Value(dtype='int32', id=None),
 'question': Value(dtype='string', id=None),
 'context': Sequence(feature={'contexts': Value(dtype='string', id=None), 'labels': Value(dtype='string', id=None), 'meshes': Value(dtype='string', id=None), 'reasoning_required_pred': Value(dtype='string', id=None), 'reasoning_free_pred': Value(dtype='string', id=None)}, length=-1, id=None),
 'long_answer': Value(dtype='string', id=None),
 'final_decision': Value(dtype='string', id=None)}

In [54]:

iter=pubmed.iter(batch_size=1)
for i in iter:
    print(f"i is {i}") 
    # print()
    print('***********************************************************')    
    print(f"pubmed id is {i['pubid']}, question is {i['question']}")
    print()
    # print()
    # print(f"contexts are {i['context']}")
    print( "len(i['context'])", len(i['context']) )
    for j in i['context']:
        print(f"{j['contexts']}")
        print()
    break

i is {'pubid': [21645374], 'question': ['Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?'], 'context': [{'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.', 'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD

In [55]:
pubmed[0]['context']

{'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
  'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoT