[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/training-with-wandb/02-encode.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/analytics-and-ml/model-training/training-with-wandb/02-encode.ipynb)

In [None]:
!pip install -qq wandb datasets pinecone-client sentence-transformers transformers

## Encoding arXiv Abstracts

This is part *three* of a four-part notebook series on fine-tuning encoder models with Weights & Biases for use with Pinecone. Find the [full set of notebooks on Github here](https://github.com/pinecone-io/examples/blob/master/analytics-and-ml/model-training/training-with-wandb).

We start by loading two datasets from WandB created in the very first [W&B notebook](https://github.com/pinecone-io/examples/blob/master/analytics-and-ml/model-training/training-with-wandb/00-intro-and-summarizer-train.ipynb).

In [1]:
import wandb
import json

run = wandb.init(project="arxiv-searching")
# download
artifact = run.use_artifact('events/arxiv-searching/arxiv-papers:latest', type='dataset')
artifact_dir = artifact.download()

# open file generator
path = artifact_dir+'/arxiv-snapshot'
def arxiv_metadata():
    with open(path, 'r') as f:
        for line in f:
            doc_dict = json.loads(line)
            yield doc_dict
metadata = arxiv_metadata()
# get count of items
count = 0
for row in metadata:
    count += 1
# refresh generator
metadata = arxiv_metadata()
print(count)
print(row)

[34m[1mwandb[0m: Currently logged in as: [33mjamesbriggs[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact arxiv-papers:latest, 3388.22MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.1


2151137
{'id': 'supr-con/9609004', 'submitter': 'Masanori Ichioka', 'authors': 'Naoki Enomoto, Masanori Ichioka and Kazushige Machida (Okayama Univ.)', 'title': 'Ginzburg Landau theory for d-wave pairing and fourfold symmetric vortex\n  core structure', 'comments': '12 pages including 8 eps figs, LaTeX with jpsj.sty & epsfig', 'journal-ref': 'J. Phys. Soc. Jpn. 66, 204 (1997).', 'doi': '10.1143/JPSJ.66.204', 'report-no': None, 'categories': 'supr-con cond-mat.supr-con', 'license': None, 'abstract': "  The Ginzburg Landau theory for d_{x^2-y^2}-wave superconductors is\nconstructed, by starting from the Gor'kov equation with including correction\nterms up to the next order of ln(T_c/T). Some of the non-local correction terms\nare found to break the cylindrical symmetry and lead to the fourfold symmetric\ncore structure, reflecting the internal degree of freedom in the pair\npotential. Using this extended Ginzburg Landau theory, we investigate the\nfourfold symmetric structure of the pair

We will encode all of the `'abstract'` values with the `minilm-arxiv-encoder` model we previously trained and stored as an artifact on W&B.

First download the artifact files:

In [2]:
artifact = run.use_artifact(
    'jamesbriggs/arxiv-searching/minilm-arxiv:latest', type='model'
)
artifact_dir = artifact.download()
artifact_dir

[34m[1mwandb[0m: Downloading large artifact minilm-arxiv:latest, 128.23MB. 6 files... 
[34m[1mwandb[0m:   6 of 6 files downloaded.  
Done. 0:0:0.1


'./artifacts/minilm-arxiv:v1'

In here we will find all of the model files needed to initialize our fine-tuned sentence transformer:

In [3]:
import os

os.listdir(artifact_dir)

['vocab.txt',
 'tokenizer.json',
 'pytorch_model.bin',
 'special_tokens_map.json',
 'tokenizer_config.json',
 'config.json']

We can do that like so:

In [4]:
from sentence_transformers import models, SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

minilm = models.Transformer(artifact_dir)
pooling = models.Pooling(
    minilm.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

model = SentenceTransformer(
    modules=[minilm, pooling],
    device=device
)
model

  from .autonotebook import tqdm as notebook_tqdm


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [5]:
model.encode([row['abstract']])

array([[-1.78127125e-01,  2.79968888e-01, -1.10255487e-01,
         6.38929894e-03, -2.01225877e-01,  1.51545078e-01,
         1.00171259e-02,  3.63548957e-02, -1.11633286e-01,
        -1.86091021e-01, -1.80149123e-01,  1.60711005e-01,
         1.74418956e-01,  4.35423665e-02,  2.56152838e-01,
         1.36271194e-01, -2.20926031e-01,  1.72951341e-01,
         1.11775815e-01,  3.24884406e-03, -3.29344273e-02,
         7.01274052e-02,  5.62849566e-02,  5.84686697e-02,
         6.33804947e-02,  1.59427132e-02,  1.90770373e-01,
         3.12118349e-03,  1.37064531e-01, -5.43970279e-02,
         4.25677076e-02,  1.51187047e-01, -4.71253663e-01,
        -1.16020828e-01, -1.14065185e-01, -1.56056330e-01,
         1.60679163e-03,  2.31568962e-02, -3.18167359e-02,
        -1.16474792e-01, -5.54910935e-02,  2.14383662e-01,
         3.19849402e-02, -5.36291003e-02, -5.07795922e-02,
         1.01412281e-01,  1.14509769e-01,  1.15304410e-01,
        -2.12687105e-02,  1.00387290e-01,  7.25929290e-0

We must encode and then `upsert` our encoded vectors to Pinecone. For this we need to initialize a Pinecone index. First we connect to Pinecone using a [free API key](https://app.pinecone.io).

In [6]:
from pinecone import Pinecone

pinecone.init(
    api_key='YOUR_API_KEY',  # app.pinecone.io
    environment='YOUR_ENV'  # find next to API key in console
)

In [7]:
index_id = 'arxiv-search'

# create index if doesn't exist
if not index_id in pinecone.list_indexes().names():
    pinecone.create_index(
        index_id,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine',
        pod_type='s1'
    )

# connect to index
index = pinecone.Index(index_id)

Now index everything in Pinecone...

In [None]:
from tqdm.auto import tqdm

batch_size = 90

batch_i = 0
batch = []

for row in tqdm(metadata, total=count):
    batch_i += 1
    batch.append({'id': row['id'], 'abstract': row['abstract']})
    if batch_i == batch_size:
        embeds = model.encode([x['abstract'] for x in batch]).tolist()
        meta = [{'abstract': x['abstract']} for x in batch]
        ids = [x['id'] for x in batch]
        # add to pinecone
        to_upsert = list(zip(ids, embeds, meta))
        index.upsert(vectors=to_upsert)
        # reset batch
        batch = []
        batch_i = 0
        
# add final items if any left
if len(batch) > 0:
    embeds = model.encode([x['abstract'] for x in batch]).tolist()
    meta = [{'abstract': x['abstract']} for x in batch]
    ids = [x['id'] for x in batch]
    # add to pinecone
    to_upsert = list(zip(ids, embeds, meta))
    index.upsert(vectors=to_upsert)

  1%|          | 13140/2151137 [00:47<2:08:27, 277.38it/s]

All that is left after this is to begin making queries, we'll do this in the final notebook [`03-query.ipynb`](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/projects/training-with-wandb/03-query.ipynb)