[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/training-with-wandb/03-query.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/training-with-wandb/03-query.ipynb)

In [None]:
!pip install -qq wandb datasets pinecone-client sentence-transformers transformers

## Encoder Training

This is part *four* of a four-part notebook series on fine-tuning encoder models with Weights & Biases for use with Pinecone. Find the [full set of notebooks on Github here](https://github.com/pinecone-io/examples/blob/master/analytics-and-ml/model-training/training-with-wandb).

We start by initializing our connection to Pinecone. You should already have an [API key from here](https://app.pinecone.io) if following the previous notebooks.

In [1]:
from pinecone import Pinecone

pinecone.init(
    api_key='YOUR_API_KEY',  # app.pinecone.io
    environment='YOUR_ENV'  # find next to api key in console
)

index_id = 'arxiv-search'

# connect to index
index = pinecone.Index(index_id)

We load the model from W&B as before...

In [2]:
import wandb

run = wandb.init(project="arxiv-searching")

artifact = run.use_artifact(
    'jamesbriggs/arxiv-searching/minilm-arxiv:latest', type='model'
)
artifact_dir = artifact.download()
artifact_dir

[34m[1mwandb[0m: Currently logged in as: [33mjamesbriggs[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact minilm-arxiv:latest, 128.23MB. 6 files... 
[34m[1mwandb[0m:   6 of 6 files downloaded.  
Done. 0:0:0.1


'./artifacts/minilm-arxiv:v1'

Initialize it as a sentence transformer...

In [6]:
from sentence_transformers import models, SentenceTransformer
import torch

minilm = models.Transformer(artifact_dir)
pooling = models.Pooling(
    minilm.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(
    modules=[minilm, pooling],
    device='cuda:1' if torch.cuda.is_available() else 'cpu'
)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Let's create an initial query and matching context.

In [8]:
query = "Ginzburg Landau theory for d-wave pairing and fourfold symmetric vortex\n  core structure"

xq = model.encode(query)

In [7]:
c = "  The Ginzburg Landau theory for d_{x^2-y^2}-wave superconductors is\nconstructed, by starting from the Gor'kov equation with including correction\nterms up to the next order of ln(T_c/T). Some of the non-local correction terms\nare found to break the cylindrical symmetry and lead to the fourfold symmetric\ncore structure, reflecting the internal degree of freedom in the pair\npotential. Using this extended Ginzburg Landau theory, we investigate the\nfourfold symmetric structure of the pair potential, current and magnetic field\naround an isolated single vortex, and clarify concretely how the vortex core\nstructure deviates from the cylindrical symmetry in the d_{x^2-y^2}-wave\nsuperconductors.\n"

In [9]:
cc = model.encode(c)

Let's do a quick sanity check:

In [10]:
from sentence_transformers.util import cos_sim

In [11]:
cos_sim(cc, xq)

tensor([[0.6709]])

In [12]:
r = "something completely random that has nothing to do with the query"
rr = model.encode(r)

In [13]:
cos_sim(rr, xq)

tensor([[-0.0784]])

Looks good the cosine similarity between the actual pair is high, whereas between something complete random it is not. Now let's begin performing actual queries:

In [15]:
xc = index.query(vector=xq.tolist(), top_k=5, include_metadata=True)
xc

{'matches': [{'id': '1304.4032',
              'metadata': {'abstract': '  A procedure to derive the '
                                       'Ginzburg-Landau (GL) theory from the '
                                       'multiband BCS\n'
                                       'Hamiltonian is developed in a general '
                                       'case with an arbitrary number of '
                                       'bands\n'
                                       'and arbitrary interaction matrix. It '
                                       "combines the standard Gor'kov "
                                       'truncation\n'
                                       'and a subsequent reconstruction in '
                                       'order to match accuracies of the '
                                       'obtained\n'
                                       'terms. This reconstruction recovers '
                                       'the phenomenological GL theory

We get a perfect result with the top item being the specific abstract for the given title. We can try with some more...

In [17]:
query = "classifier for genre categorization of a game"

xq = model.encode(query).tolist()

xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '2105.05674',
              'metadata': {'abstract': '  Game developers benefit from '
                                       'availability of custom game genres '
                                       'when doing\n'
                                       'game market analysis. This information '
                                       'can help them to spot opportunities '
                                       'in\n'
                                       'market and make them more successful '
                                       'in planning a new game. In this paper '
                                       'we\n'
                                       'find good classifier for predicting '
                                       'category of a game. Prediction is '
                                       'based on\n'
                                       'description and title of a game. We '
                                       'use 2443 iOS App Store games

Another great result, let's try one more...

In [19]:
query = "the best pre-training approaches for multi-modal machine learning"

xq = model.encode(query).tolist()

xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '2206.05555',
              'metadata': {'abstract': '  Multi-modal pre-training and '
                                       'knowledge discovery are two important '
                                       'research\n'
                                       'topics in multi-modal machine '
                                       'learning. Nevertheless, none of '
                                       'existing works\n'
                                       'make attempts to link knowledge '
                                       'discovery with knowledge guided '
                                       'multi-modal\n'
                                       'pre-training. In this paper, we '
                                       'propose to unify them into a '
                                       'continuous\n'
                                       'learning framework for mutual '
                                       'improvement. Taking the open-domain '
     

In this case it seems the best result is in position *5* (still not bad out of 2M+ abstracts) and the top result seems relevant but might not quite answer our query.

Once we're finished, we delete the index to save resources.

In [None]:
pinecone.delete_index(index_id)

---