[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/openai/beyond_search_webinar/01_index-init.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/openai/beyond_search_webinar/01_index-init.ipynb)

# Index Init

We use this notebook to push context data to Pinecone, before running this notebook the `data/curie_embeddings.parquet` file must have been created.

In [2]:
import pandas as pd

df = pd.read_parquet('data/curie_embeddings.parquet')
df.head()

Unnamed: 0,docs,category,thread,href,question,context,marked,text,n_tokens,embeddings
0,huggingface,Beginners,Can’t download (some) models although they are...,https://discuss.huggingface.co/t/cant-download...,"Can’t download (some) models to pytorch, altho...",Looking at umarayub/t5-small-finetuned-xsum at...,0,Topic: huggingface - Beginners; Question: Can’...,550,"[0.004923707339912653, -0.016777075827121735, ..."
1,huggingface,Beginners,"Trainer.push_to_hub is taking lot of time, is ...",https://discuss.huggingface.co/t/trainer-push-...,"Hi, I’m trying to push my model to HF hub via ...",@sgugger can you please help me out with this...,0,Topic: huggingface - Beginners; Question: Trai...,204,"[0.0020476023200899363, -0.0010360622545704246..."
2,huggingface,Beginners,SSLCertVerificationError when loading a model,https://discuss.huggingface.co/t/sslcertverifi...,I am exploring potential opportunities of usin...,I’m also getting the same error. Please let me...,0,Topic: huggingface - Beginners; Question: SSLC...,494,"[0.002923486055806279, 0.007949204184114933, 0..."
3,huggingface,Beginners,How to use embeddings to compute similarity?,https://discuss.huggingface.co/t/how-to-use-em...,"Hi, I would like to compute sentence similarit...","With transformers, the feature-extraction pipe...",0,Topic: huggingface - Beginners; Question: How ...,351,"[-0.011044162325561047, 0.0021849798504263163,..."
4,huggingface,Beginners,How to use additional input features for NER?,https://discuss.huggingface.co/t/how-to-use-ad...,"Hello,\nI’ve been following the documentation ...","mhl:\n\ne.g [“Arizona_NNP”, “Ice_NNP”, “Tea_NN...",0,Topic: huggingface - Beginners; Question: How ...,1718,"[0.002879042411223054, -0.004730842541903257, ..."


The max size limit for metadata in Pinecone is 5KB, let's check the *text* field to see if it has items greater than this.

In [12]:
from sys import getsizeof

too_big = []

for text in df['text'].tolist():
    if getsizeof(text) > 5000:
        too_big.append((text, getsizeof(text)))

print(f"{len(too_big)} / {len(df)} records are too big")

1047 / 5957 records are too big


Unfortunately there are plenty, so we will make sure to include some mapping from retrieved IDs (from Pinecone) to the original text. We can do this by assigning a unique ID to each text item and storing it with the Streamlit app.

In [6]:
df['id'] = [str(i) for i in range(len(df))]
df.head()

Unnamed: 0,docs,category,thread,href,question,context,marked,text,n_tokens,embeddings,id
0,huggingface,Beginners,Can’t download (some) models although they are...,https://discuss.huggingface.co/t/cant-download...,"Can’t download (some) models to pytorch, altho...",Looking at umarayub/t5-small-finetuned-xsum at...,0,Topic: huggingface - Beginners; Question: Can’...,550,"[0.004923707339912653, -0.016777075827121735, ...",0
1,huggingface,Beginners,"Trainer.push_to_hub is taking lot of time, is ...",https://discuss.huggingface.co/t/trainer-push-...,"Hi, I’m trying to push my model to HF hub via ...",@sgugger can you please help me out with this...,0,Topic: huggingface - Beginners; Question: Trai...,204,"[0.0020476023200899363, -0.0010360622545704246...",1
2,huggingface,Beginners,SSLCertVerificationError when loading a model,https://discuss.huggingface.co/t/sslcertverifi...,I am exploring potential opportunities of usin...,I’m also getting the same error. Please let me...,0,Topic: huggingface - Beginners; Question: SSLC...,494,"[0.002923486055806279, 0.007949204184114933, 0...",2
3,huggingface,Beginners,How to use embeddings to compute similarity?,https://discuss.huggingface.co/t/how-to-use-em...,"Hi, I would like to compute sentence similarit...","With transformers, the feature-extraction pipe...",0,Topic: huggingface - Beginners; Question: How ...,351,"[-0.011044162325561047, 0.0021849798504263163,...",3
4,huggingface,Beginners,How to use additional input features for NER?,https://discuss.huggingface.co/t/how-to-use-ad...,"Hello,\nI’ve been following the documentation ...","mhl:\n\ne.g [“Arizona_NNP”, “Ice_NNP”, “Tea_NN...",0,Topic: huggingface - Beginners; Question: How ...,1718,"[0.002879042411223054, -0.004730842541903257, ...",4


Now let's populate the Pinecone index.

In [24]:
from pinecone import Pinecone

pinecone.init(
    api_key='PINECONE_API_KEY',  # app.pinecone.io
    environment="YOUR_ENV"  # find next to API key in console
)

index_name = 'beyond-search-openai'

if not index_name in pinecone.list_indexes().names():
    pinecone.create_index(
        index_name, dimension=len(df['embeddings'].tolist()[0]),
        metric='cosine'
    )

index = pinecone.Index(index_name)

We will populate in batches, including any relevant metadata like *docs*, *category*, *thread*, and *href*.

In [25]:
from tqdm.auto import tqdm

batch_size = 32

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i+batch_size, len(df))
    df_slice = df.iloc[i:i_end]
    to_upsert = [
        (
            row['id'],
            row['embeddings'].tolist(),
            {
                'docs': row['docs'],
                'category': row['category'],
                'thread': row['thread'],
                'href': row['href'],
                'n_tokens': row['n_tokens']
            }
        ) for _, row in df_slice.iterrows()
    ]
    index.upsert(vectors=to_upsert)

100%|██████████| 187/187 [08:36<00:00,  2.76s/it]


We'll save another dataset to file containing just ID -> text mappings.

In [20]:
mappings = {row['id']: row['text'] for _, row in df[['id', 'text']].iterrows()}

In [22]:
import json

with open('data/mapping.json', 'w') as fp:
    json.dump(mappings, fp)

---