[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/medical-qa/pubmed-splade.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/hybrid-search/medical-qa/pubmed-splade.ipynb)

In [1]:
!pip install -qU datasets transformers sentence-transformers git+https://git@github.com/pinecone-io/pinecone-python-client.git#egg=pinecone-client[grpc] git+https://github.com/naver/splade.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone


## Dataset Preparation

We will use the PubMed dataset from Hugging Face Spaces...

In [2]:
from datasets import load_dataset

pubmed = load_dataset(
    'pubmed_qa',
    'pqa_labeled',
    split='train'
)
pubmed



Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 1000
})

In [3]:
pubmed[0]['pubid'], pubmed[0]['context']

(21645374,
 {'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
   'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondr

We need to cut our contexts into digestable chunks for our models. We'll be using BERT which has a max sequence length of `512` tokens, *but* typical sentence transformers limit this to `128`.

To be safe and ensure we're not over the smaller `128` token limit we will assume an average token length of `3` characters (in reality it is more like *3-5*) and therefore our required length will be `128*3 == 384` characters.

To build passages of this length we will define a processing function called `chunker`.

In [4]:
limit = 384

def chunker(contexts: list):
    chunks = []
    all_contexts = ' '.join(contexts).split('.')
    chunk = []
    for context in all_contexts:
        chunk.append(context)
        if len(chunk) >= 3 and len('.'.join(chunk)) > limit:
            # surpassed limit so add to chunks and reset
            chunks.append('.'.join(chunk).strip()+'.')
            # add some overlap between passages
            chunk = chunk[-2:]
    # if we finish and still have a chunk, add it
    if chunk is not None:
        chunks.append('.'.join(chunk))
    return chunks

In [5]:
chunks = chunker(pubmed[0]['context']['contexts'])
chunks

['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature.',
 'The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants. The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A.',
 'The role of mitochondria during PCD has been recognized in animals; however, it has been less

We need to give each chunk a unique ID, like so:

In [6]:
ids = []
for i in range(len(chunks)):
    ids.append(f"{pubmed[0]['pubid']}-{i}")
ids

['21645374-0',
 '21645374-1',
 '21645374-2',
 '21645374-3',
 '21645374-4',
 '21645374-5',
 '21645374-6']

We create the full contexts dataset with this logic like so:

In [7]:
data = []
for record in pubmed:
    chunks = chunker(record['context']['contexts'])
    for i, context in enumerate(chunks):
        data.append({
            'id': f"{record['pubid']}-{i}",
            'context': context
        })

data[:10]

[{'id': '21645374-0',
  'context': 'Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature.'},
 {'id': '21645374-1',
  'context': 'The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants. The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A.'},
 {'id': '21645374-2',
  '

## Model Initialization and Vectors

With our dataset prepared we can move on to initializing the required models and setting up some helper functions to make sparse and dense vector building easy.

### Dense Vectors

Starting with the dense vectors, we will use an off-the-shelf model from the `sentence-transformers` library.

In [8]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# check device being run on
if device != 'cuda':
    print("==========\n"+
          "WARNING: You are not running on GPU so this may be slow.\n"+
          "If on Google Colab, go to top menu > Runtime > Change "+
          "runtime type > Hardware accelerator > 'GPU' and rerun "+
          "the notebook.\n==========")

dense_model = SentenceTransformer(
    'msmarco-bert-base-dot-v5',
    device=device
)
dense_model

Downloading (…)8df09/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)50dc78df09/README.md:   0%|          | 0.00/6.14k [00:00<?, ?B/s]

Downloading (…)dc78df09/config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)8df09/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

Downloading (…)df09/train_script.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading (…)50dc78df09/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)c78df09/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

<small>Note: we are using a model that has not been trained on the pubmed dataset to emulate a real scenario where you have little-to-no data for a use case. If using a model like `multi-qa-mpnet-base-dot-v1` that has seen pubmed before - it can handle this dataset very well and it's harder to see the effect of splade vs no-splade.</small>

We then create an embedding very easily like so:

In [9]:
emb = dense_model.encode(data[0]['context'])
emb.shape

(768,)

The model returns `768` dimensional dense vectors, this is also reflected in the model attributes.

In [10]:
dim = dense_model.get_sentence_embedding_dimension()
dim

768

### Sparse Vectors

We will also need to create sparse vectors. For that we will be using a learned sparse embedding model called SPLADE. SPLADE actually consists of many models that use similar embedding methods, we will be using the `naver/splade-cocondenser-ensembledistil` model.

We initialize the model like so:

In [11]:
from splade.models.transformer_rep import Splade

sparse_model_id = 'naver/splade-cocondenser-ensembledistil'

sparse_model = Splade(sparse_model_id, agg='max')
sparse_model.to(device)  # move to GPU if possible
sparse_model.eval()

The model takes tokenized inputs that are built using a tokenizer initialized with the same model ID.

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(sparse_model_id)

tokens = tokenizer(data[0]['context'], return_tensors='pt')

To create sparse vectors we do:

In [13]:
with torch.no_grad():
    sparse_emb = sparse_model(
        d_kwargs=tokens.to(device)
    )['d_rep'].squeeze()
sparse_emb.shape

torch.Size([30522])

In [14]:
sparse_emb

tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0')

Leaving us with a `30522` dimensional sparse vector embedding. Pinecone will expect a dictionary style format of the sparse vector. To build it we take a couple more steps.

First we get a list of non-zero positions in the vector.

In [15]:
indices = sparse_emb.nonzero().squeeze().cpu().tolist()
print(len(indices))

173


We have `174` non-zero values, we use them to create a dictionary of index positions to scores like so:

In [16]:
values = sparse_emb[indices].cpu().tolist()
sparse = {'indices': indices, 'values': values}
sparse

{'indices': [1000,
  1039,
  1052,
  1997,
  1999,
  2003,
  2024,
  2049,
  2083,
  2094,
  2173,
  2239,
  2278,
  2290,
  2306,
  2331,
  2415,
  2427,
  2523,
  2537,
  2550,
  2565,
  2566,
  2597,
  2644,
  2754,
  2757,
  2832,
  2974,
  3030,
  3081,
  3102,
  3252,
  3269,
  3274,
  3280,
  3370,
  3392,
  3399,
  3508,
  3526,
  3571,
  3581,
  3628,
  3727,
  3740,
  3817,
  3965,
  3968,
  4264,
  4295,
  4372,
  4442,
  4456,
  4574,
  4649,
  4717,
  4730,
  4758,
  4775,
  4870,
  4962,
  4963,
  5080,
  5104,
  5258,
  5397,
  5701,
  5708,
  5920,
  5996,
  6198,
  6210,
  6215,
  6310,
  6418,
  6470,
  6531,
  6546,
  6580,
  6897,
  7053,
  7337,
  7366,
  7403,
  7473,
  7609,
  7691,
  7775,
  7816,
  8475,
  8676,
  8715,
  8761,
  8765,
  8872,
  8979,
  9007,
  9232,
  9448,
  9607,
  9706,
  9890,
  9895,
  9915,
  10012,
  10088,
  10244,
  10267,
  10327,
  10507,
  10708,
  10738,
  11503,
  11568,
  11704,
  11767,
  11798,
  11829,
  11934,
  12222,
  124

This is the format that Pinecone requires, from here we can move on to indexing. We'll take a brief detour to explore the sparse vectors, but if you'd rather jump straight to indexing, skip the next section.

---

#### Reading Sparse Embedding

But moving onto indexing everything in Pinecone, let's take a moment to understand what our sparse vector means. We can translate these into a human readable format so we can see what this sparse vector is actually representing.

We create a way of mapping from index positions to actual BERT tokenizer tokens.

In [17]:
idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}

Then create the mappings like we did with the Pinecone-friendly sparse format above.

In [18]:
sparse_dict_tokens = {
    idx2token[idx]: round(weight, 2) for idx, weight in zip(indices, values)
}
# sort so we can see most relevant tokens first
sparse_dict_tokens = {
    k: v for k, v in sorted(
        sparse_dict_tokens.items(),
        key=lambda item: item[1],
        reverse=True
    )
}
sparse_dict_tokens

{'pc': 3.02,
 'lace': 2.95,
 'programmed': 2.36,
 '##for': 2.28,
 'madagascar': 2.26,
 'death': 1.96,
 '##d': 1.95,
 'lattice': 1.81,
 'cell': 1.69,
 '##iensis': 1.64,
 'malaga': 1.6,
 '##get': 1.56,
 'regulated': 1.53,
 'die': 1.51,
 'lacey': 1.5,
 '##ono': 1.46,
 '##ole': 1.45,
 '##oles': 1.45,
 '##scu': 1.39,
 'transverse': 1.38,
 'leaves': 1.34,
 'cells': 1.31,
 'longitudinal': 1.31,
 'plant': 1.21,
 'plants': 1.16,
 'leaf': 1.15,
 'ap': 1.14,
 'organism': 1.11,
 'per': 1.1,
 'regulation': 1.03,
 'veins': 1.02,
 'organisms': 1.0,
 '##work': 0.99,
 'are': 0.94,
 'modified': 0.93,
 'controlled': 0.92,
 'dead': 0.9,
 'occur': 0.9,
 'disorder': 0.87,
 'program': 0.82,
 '##lat': 0.81,
 'through': 0.76,
 '##cl': 0.74,
 'computer': 0.71,
 '##ations': 0.7,
 'abbreviation': 0.69,
 'produced': 0.67,
 'is': 0.65,
 'center': 0.63,
 '"': 0.62,
 'produce': 0.62,
 'technology': 0.61,
 'process': 0.6,
 '##osing': 0.59,
 'matt': 0.54,
 'cc': 0.54,
 '##ation': 0.53,
 'outward': 0.53,
 'gage': 0.52,


---

## Indexing Everything

To build the vector DB we will need to index everything, for this we will need to initialize our connection to Pinecone, create an index, and insert everything in the format:

```python
{
    'id': 'id-123',
    'values': [0.1, 0.2, ...],  # dense vec
    'sparse_values': {
        'indices': [23, 718],
        'values': [0.25, 0.77]
    },  # sparse vec
    'metadata': {"context": "some text here"}  # metadata dict
}
```

To make things easier we can create a helper function to transform a list of records from `data` into this format, we'll call it `builder`:

In [19]:
from pinecone import Pinecone


def builder(records: list):
    ids = [x['id'] for x in records]
    contexts = [x['context'] for x in records]
    # create dense vecs
    dense_vecs = dense_model.encode(contexts).tolist()
    # create sparse vecs
    input_ids = tokenizer(
        contexts, return_tensors='pt',
        padding=True, truncation=True
    )
    with torch.no_grad():
        sparse_vecs = sparse_model(
            d_kwargs=input_ids.to(device)
        )['d_rep'].squeeze()
    # convert to upsert format
    upserts = []
    for _id, dense_vec, sparse_vec, context in zip(ids, dense_vecs, sparse_vecs, contexts):
        # extract columns where there are non-zero weights
        indices = sparse_vec.nonzero().squeeze().cpu().tolist()  # positions
        values = sparse_vec[indices].cpu().tolist()  # weights/scores
        # build sparse values dictionary
        sparse_values = {
            "indices": indices,
            "values": values
        }
        # build metadata struct
        metadata = {'context': context}
        # append all to upserts list as pinecone.Vector (or GRPCVector)
        upserts.append({
            'id': _id,
            'values': dense_vec,
            'sparse_values': sparse_values,
            'metadata': metadata
        })
    return upserts

In [20]:
builder(data[:3])

[id: "21645374-0"
 values: -0.0860980972647667
 values: -0.06404605507850647
 values: -0.09067439287900925
 values: -0.13883446156978607
 values: 0.40349075198173523
 values: 0.04510989040136337
 values: 0.17842265963554382
 values: 0.008637930266559124
 values: 0.39867380261421204
 values: -0.12001233547925949
 values: -0.055883314460515976
 values: 0.1040591150522232
 values: -0.5984246730804443
 values: 0.4460744261741638
 values: 0.07607370615005493
 values: 0.718574583530426
 values: 0.13898858428001404
 values: -0.03241853415966034
 values: 0.05966181308031082
 values: 0.05813855305314064
 values: -0.14696815609931946
 values: 0.02058224566280842
 values: 0.7175166606903076
 values: 0.26266899704933167
 values: 0.18689090013504028
 values: -0.27962222695350647
 values: -0.4334171712398529
 values: -0.36501309275627136
 values: -0.4082491993904114
 values: 0.4922325313091278
 values: -0.04993252828717232
 values: -0.3248228430747986
 values: 0.14582324028015137
 values: -0.2137992

Now we initialize our connection to Pinecone using a [free API key](https://app.pinecone.io/).

In [21]:
from pinecone import Pinecone

pinecone.init(
    api_key="YOUR_API_KEY",  # app.pinecone.io
    environment="YOUR_ENV"  # next to api key in console
)

Then create a new sparse-dense enabled index (this requires `metric="dotproduct"` and `pod_type` to be `p1` or `s1`.

In [22]:
index_name = 'pubmed-splade'

pinecone.create_index(
    index_name,
    dimension=dim,
    metric="dotproduct",
    pod_type="s1"
)

Initialize with `Index` or `Index`:

In [23]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Upsert to sparse-dense is simple:

In [24]:
index.upsert(builder(data[:3]))

upserted_count: 3

We can repeat this and iterate through (and index) the full dataset:

In [25]:
from tqdm.auto import tqdm

batch_size = 64

for i in tqdm(range(0, len(data), batch_size)):
    # extract batch of data
    i_end = min(i+batch_size, len(data))
    batch = data[i:i_end]
    # pass data to builder and upsert
    index.upsert(builder(data[i:i+batch_size]))

  0%|          | 0/93 [00:00<?, ?it/s]

We can check the number of upserted records aligns with the length of `data`.

In [26]:
len(data), index.describe_index_stats()

(5930, {'dimension': 768,
  'index_fullness': 0.0,
  'namespaces': {'': {'vector_count': 5930}},
  'total_vector_count': 5930})

And now we can move on to querying...

## Queries

Our queries need to contain both dense and sparse vectors, we will define a function `encode` to handle the construction of vectors from text.

In [27]:
def encode(text: str):
    # create dense vec
    dense_vec = dense_model.encode(text).tolist()
    # create sparse vec
    input_ids = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        sparse_vec = sparse_model(
            d_kwargs=input_ids.to(device)
        )['d_rep'].squeeze()
    # convert to dictionary format
    indices = sparse_vec.nonzero().squeeze().cpu().tolist()
    values = sparse_vec[indices].cpu().tolist()
    sparse_dict = {"indices": indices, "values": values}
    # return vecs
    return dense_vec, sparse_dict

In [37]:
query = "Can clinicians use the PHQ-9 to assess depression in people with vision loss?"
dense, sparse = encode(query)
# query
xc = index.query(
    vector=dense,
    sparse_vector=sparse,
    top_k=2,  # how many results to return
    include_metadata=True
)
xc

{'matches': [{'id': '19156007-0',
              'metadata': {'context': 'To investigate whether the Patient '
                                      'Health Questionnaire-9 (PHQ-9) '
                                      'possesses the essential psychometric '
                                      'characteristics to measure depressive '
                                      'symptoms in people with visual '
                                      'impairment. The PHQ-9 scale was '
                                      'completed by 103 participants with low '
                                      'vision. These data were then assessed '
                                      'for fit to the Rasch model. The '
                                      "participants' mean +/- standard "
                                      'deviation (SD) age was 74.7 +/- 12.2 '
                                      'years.'},
              'score': 203.74826,
              'sparse_values': {'indices': [], 'va

We get three answers that talk about **PHQ-9** and **depression**, but only the first result is relevant to our specific scenario of these items for people with **vision loss**. This is a strong result. However, how much of this is down to the sparse vs dense components?

We can actually modify the dense and sparse vectors being used to query, so that we _"weight"_ one or the other to be more/less important.

We set a scaling value `alpha` and implement it so that when `alpha == 0` we are doing a pure **sparse** search, and when `alpha == 1` we are doing a pure **dense** search.

In [38]:
def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse

Let's try a pure dense search:

In [39]:
hdense, hsparse = hybrid_scale(dense, sparse, alpha=1.0)
# query
xc = index.query(
    vector=hdense,
    sparse_vector=hsparse,
    top_k=2,  # how many results to return
    include_metadata=True
)
xc

{'matches': [{'id': '19156007-0',
              'metadata': {'context': 'To investigate whether the Patient '
                                      'Health Questionnaire-9 (PHQ-9) '
                                      'possesses the essential psychometric '
                                      'characteristics to measure depressive '
                                      'symptoms in people with visual '
                                      'impairment. The PHQ-9 scale was '
                                      'completed by 103 participants with low '
                                      'vision. These data were then assessed '
                                      'for fit to the Rasch model. The '
                                      "participants' mean +/- standard "
                                      'deviation (SD) age was 74.7 +/- 12.2 '
                                      'years.'},
              'score': 181.90709,
              'sparse_values': {'indices': [], 'va

Dense is actually performing very well here, let's try a full sparse search:

In [40]:
hdense, hsparse = hybrid_scale(dense, sparse, alpha=0.0)
# query
xc = index.query(
    vector=hdense,
    sparse_vector=hsparse,
    top_k=2,  # how many results to return
    include_metadata=True
)
xc

{'matches': [{'id': '19156007-0',
              'metadata': {'context': 'To investigate whether the Patient '
                                      'Health Questionnaire-9 (PHQ-9) '
                                      'possesses the essential psychometric '
                                      'characteristics to measure depressive '
                                      'symptoms in people with visual '
                                      'impairment. The PHQ-9 scale was '
                                      'completed by 103 participants with low '
                                      'vision. These data were then assessed '
                                      'for fit to the Rasch model. The '
                                      "participants' mean +/- standard "
                                      'deviation (SD) age was 74.7 +/- 12.2 '
                                      'years.'},
              'score': 21.841171,
              'sparse_values': {'indices': [], 'va

In this scenario, both models are returning good results. But we can see variation in the results and their respective scores when switching between more or less sparse-vs-dense.

In [41]:
query = "Does ibuprofen increase perioperative blood loss during hip arthroplasty?"
dense, sparse = encode(query)
hdense, hsparse = hybrid_scale(dense, sparse, alpha=0.0)  # pure SPARSE
# query
xc = index.query(
    vector=hdense,
    sparse_vector=hsparse,
    top_k=2,  # how many results to return
    include_metadata=True
)
xc

{'matches': [{'id': '12442934-0',
              'metadata': {'context': 'To determine whether prior exposure of '
                                      'non-steroidal anti-inflammatory drugs '
                                      'increases perioperative blood loss '
                                      'associated with major orthopaedic '
                                      'surgery. Fifty patients scheduled for '
                                      'total hip replacement were allocated to '
                                      'two groups (double blind, randomized '
                                      'manner). All patients were pretreated '
                                      'for 2 weeks before surgery: Group 1 '
                                      'with placebo drug, Group 2 with '
                                      'ibuprofen. All patients were injected '
                                      'intrathecally with bupivacaine 20mg '
                                 

Here the term `hip arthroplasty` refers to a hip replacement. Using Splade we get the answer ranked at #1. Let's try with dense:

In [42]:
query = "Does ibuprofen increase perioperative blood loss during hip arthroplasty?"
dense, sparse = encode(query)
hdense, hsparse = hybrid_scale(dense, sparse, alpha=1.0)  # pure DENSE
# query
xc = index.query(
    vector=hdense,
    sparse_vector=hsparse,
    top_k=2,  # how many results to return
    include_metadata=True
)
xc

{'matches': [{'id': '12442934-3',
              'metadata': {'context': ' The perioperative blood loss increased '
                                      'by 45% in the ibuprofen group compared '
                                      'with placebo. The total (+/-SD) blood '
                                      'loss in the ibuprofen group was 1161 '
                                      '(+/-472) mL versus 796 (+/-337) mL in '
                                      'the placebo group.'},
              'score': 177.40633,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '12442934-0',
              'metadata': {'context': 'To determine whether prior exposure of '
                                      'non-steroidal anti-inflammatory drugs '
                                      'increases perioperative blood loss '
                                      'associated with major orthopaedic '
                                      '

Using dense only we still get the "best" answer at #2, but we return a less relevant answer first. For this question and others, SLADE can outperform dense models, particularly those that haven't been fine-tuned on the data source.

Naturally, searching with both sparse and dense can give us the best of both worlds. We give important to keyword matches, without losing the semantic meaning component.

Try some more questions!

Here are a few ideas [sourced from the PubMedQA paper](https://aclanthology.org/D19-1259/):

```
Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?

Spontaneous electrocardiogram alterations predict ventricular fibrillation in Brugada syndrome

Do liver grafts from selected older donors have significantly more ischaemia reperfusion injury?

Does reducing spasticity translate into functional benefit?

Does ibuprofen increase perioperative blood loss during hip arthroplasty?

Should circumcision be performed in childhood?

Is external palliative radiotherapy for gallbladder carcinoma effective?

Sternal fracture in growing children: A rare and often overlooked fracture?

Xanthogranulomatous cholecystitis: a premalignant condition?

Can PRISM predict length of PICU stay?

Is trabecular bone related to primary stability of miniscrews?
```

Once you're done, delete the index to save resources.

In [43]:
pinecone.delete_index(index_name)

---