[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/gpl/02-negative-mining.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/gpl/02-negative-mining.ipynb)

To perform the negative mining step we must create a vector database to store encoded passages, and allow us to search for similar passages that do not match the query we're searching with. This requires two things:

* a pre-existing retriever model to build encodings - for this we will use a model from the *sentence-transformers* library
* a vector DB to store encodings - for this we will use Pinecone as it is an free and easy vector DB to setup, which is fast at scale

Let's load the model first.

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-tas-b')
model.max_seq_length = 256
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

And now initialize a Pinecone index for storing the encode passage vectors later.

In [2]:
from pinecone import Pinecone  # pip install pinecone-client

index_name = "negative-mining"

pinecone.init(
    api_key="YOUR_API_KEY",  # app.pinecone.io
    environment="YOUR_ENV"  # find next to API key in console
)
# create a new negative mining index if does not already exist
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='dotproduct',
        pods=1
    )
# connect
index = pinecone.Index(index_name)

Now we encode the passages and store in the `negative-mining` index.

In [3]:
from tqdm.auto import tqdm

def get_text():
    with open('data/pairs.tsv', 'r', encoding='utf-8') as fp:
        lines = fp.read().split('\n')
    for line in tqdm(lines):
        try:
            query, passage = line.split('\t')
            yield query, passage
        except ValueError:
            # in case of malformed data, pass onto next row
            pass

In [4]:
pair_gen = get_text()

pairs = []
to_upsert = []
passage_batch = []
id_batch = []
batch_size = 64

for i, (query, passage) in enumerate(pair_gen):
    pairs.append((query, passage))
    # we do this to avoid passage duplication in the vector DB
    if passage not in passage_batch: 
        passage_batch.append(passage)
        id_batch.append(str(i))
    # on reaching batch_size, we encode and upsert
    if len(passage_batch) == batch_size:
        embeds = model.encode(passage_batch).tolist()
        # upload to index
        index.upsert(vectors=list(zip(id_batch, embeds)))
        # refresh batches
        passage_batch = []
        id_batch = []
        
# check number of vectors in the index
index.describe_index_stats()

  0%|          | 0/200 [00:00<?, ?it/s]

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 67840}}}

The database is setup for us to begin the *negative mining* step. We will loop through each query in `pairs`, returning *10* of the most similar passage.

In [5]:
import random

batch_size = 100
triplets = []

for i in tqdm(range(0, len(pairs), batch_size)):
    # embed queries and query pinecone in batches to minimize network latency
    i_end = min(i+batch_size, len(pairs))
    queries = [pair[0] for pair in pairs[i:i_end]]
    pos_passages = [pair[1] for pair in pairs[i:i_end]]
    # create query embeddings
    query_embs = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
    # search for top_k most similar passages
    res = index.query(vector=query_embs.tolist(), top_k=10)
    # iterate through queries and find negatives
    for query, pos_passage, query_res in zip(queries, pos_passages, res['results']):
        top_results = query_res['matches']
        # shuffle results so they are in random order
        random.shuffle(top_results)
        for hit in top_results:
            neg_passage = pairs[int(hit['id'])][1]
            # check that we're not just returning the positive passage
            if neg_passage != pos_passage:
                # if not we can add this to our (Q, P+, P-) triplets
                triplets.append(query+'\t'+pos_passage+'\t'+neg_passage)
                break

with open('data/triplets.tsv', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(triplets))

  0%|          | 0/2000 [00:00<?, ?it/s]

In [6]:
pinecone.delete_index(index_name)  # delete the index when done to avoid higher charges (if using multiple pods)

With that we now have even more *(query, passage) pairs*, that are both positive and negative matches. The next step in GPL will see us scoring all of these pairs using a cross-encoder model.