<a href="https://colab.research.google.com/github/karmanandan/vector_databse_examples/blob/main/pinecode_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
!pip install -qU "pinecone-client[grpc]"==2.2.1 datasets==2.12.0 sentence-transformers==2.2.2

In [50]:
# load dataset
from datasets import load_dataset

dataset = load_dataset('quora',split='train[240000:290000]')
dataset



Dataset({
    features: ['questions', 'is_duplicate'],
    num_rows: 50000
})

In [51]:
dataset[:5]

{'questions': [{'id': [207550, 351729],
   'text': ['What is the truth of life?', "What's the evil truth of life?"]},
  {'id': [33183, 351730],
   'text': ['Which is the best smartphone under 20K in India?',
    'Which is the best smartphone with in 20k in India?']},
  {'id': [351731, 351732],
   'text': ['Steps taken by Canadian government to improve literacy rate?',
    'Can I send homemade herbal hair oil from India to US via postal or private courier services?']},
  {'id': [37799, 94186],
   'text': ['What is a good way to lose 30 pounds in 2 months?',
    'What can I do to lose 30 pounds in 2 months?']},
  {'id': [351733, 351734],
   'text': ['Which of the following most accurately describes the translation of the graph y = (x+3)^2 -2 to the graph of y = (x -2)^2 +2?',
    'How do you graph x + 2y = -2?']}],
 'is_duplicate': [False, True, False, True, False]}

In [52]:
questions = []
for qn in dataset['questions']:
  questions.extend(qn['text'])

# remove duplicates
questions = list(set(questions))
len(questions)

88919

In [53]:
print('\n'.join(questions[:5]))

Am I eligible for defence quota if my father was in the navy?
How do we know that we're not living in a computer simulation?
How were the IP addresses starting with 172 established?
Why does Quora seemingly always tell me that my questions need improvement when they are clear and concise questions?
Why do the Olympic winners bite their medals?


In [54]:
from sentence_transformers import SentenceTransformer

In [55]:
import torch

In [56]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [57]:
help(SentenceTransformer())

Help on SentenceTransformer in module sentence_transformers.SentenceTransformer object:

class SentenceTransformer(torch.nn.modules.container.Sequential)
 |  SentenceTransformer(model_name_or_path: Optional[str] = None, modules: Optional[Iterable[torch.nn.modules.module.Module]] = None, device: Optional[str] = None, cache_folder: Optional[str] = None, use_auth_token: Union[bool, str, NoneType] = None)
 |  
 |  Loads or create a SentenceTransformer model, that can be used to map sentences / text to embeddings.
 |  
 |  :param model_name_or_path: If it is a filepath on disc, it loads the model from that path. If it is not a path, it first tries to download a pre-trained SentenceTransformer model. If that fails, tries to construct a model from Huggingface models repository with that name.
 |  :param modules: This parameter can be used to create custom SentenceTransformer models from scratch.
 |  :param device: Device (like 'cuda' / 'cpu') that should be used for computation. If None, chec

In [58]:
model = SentenceTransformer('all-MiniLM-L6-V2',device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [59]:
print("max_seq_length:",model.max_seq_length)

max_seq_length: 256


In [60]:
query = 'who is the writer for animal farm'
query_embed = model.encode(query)
# query_embed

In [61]:
query_embed.shape

(384,)

In [62]:
_id = '0'
meta_data = {'text':query}
vectors = [(_id, query_embed, meta_data)]
vectors

[('0',
  array([-6.12460338e-02, -2.82994285e-02,  1.76846609e-02,  3.87213640e-02,
          1.00621330e-02,  2.29025371e-02, -5.57215251e-02, -4.20217812e-02,
          1.29571920e-02,  1.29865250e-02,  5.03526255e-02,  6.09714799e-02,
         -5.33461608e-02, -6.55398518e-03, -5.59711009e-02,  7.55332857e-02,
         -4.53792438e-02,  6.75516576e-02,  4.66930754e-02, -5.91803342e-02,
         -5.12790494e-02,  3.11999731e-02, -8.13353341e-03,  1.49146840e-03,
         -2.14882735e-02, -5.42149693e-02, -1.61097888e-02, -2.43777335e-02,
         -1.08334802e-01,  2.56306268e-02, -3.06039751e-02,  3.85068506e-02,
          1.11974590e-02,  1.27367089e-02,  4.45901789e-02,  5.56536764e-02,
          1.76930521e-02, -8.36187694e-03,  5.60073704e-02,  1.37353328e-03,
          2.60696504e-02, -9.27120522e-02, -1.48614459e-02, -3.33702890e-03,
         -2.20804345e-02, -3.40280123e-02, -2.18813214e-02, -9.12155211e-02,
          3.02205049e-02, -3.66846398e-02, -3.97559591e-02, -3.490092

In [63]:
len(questions)

88919

In [64]:
import os
import pinecone

# get api key from app.pinecone.io
api_key = ''
# find your environment next to the api key in pinecone console
env = 'us-west1-gcp-free'

pinecone.init(
    api_key=api_key,
    environment=env
)

index_name = 'semantic-search'

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine'
    )

# now connect to the index
index = pinecone.GRPCIndex(index_name)

In [65]:
from tqdm.auto import tqdm

vector_limit = 80000

questions = questions[:vector_limit]

In [66]:
len(questions)

80000

In [67]:
batch_size = 128

In [69]:
vector_db = []
for i in tqdm(range(0, len(questions), batch_size)):
  # find end of batch
  i_end = min(i+batch_size, len(questions))
  # create IDs batch
  ids = [str(x) for x in range(i,i_end)]
  # create metadata batch
  metadatas = [{'text':text} for text in questions[i:i_end]]
  # create embeddings
  xc = model.encode(questions[i:i_end])
  # upsert to Pinecone
  records = zip(ids, xc, metadatas)
  index.upsert(records)
  # vector_db.append(records)

# check number of records in the index
index.describe_index_stats()

  0%|          | 0/625 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.2,
 'namespaces': {'': {'vector_count': 80000}},
 'total_vector_count': 80000}

In [70]:
# len(vector_db)

0

In [71]:
625*128

80000

In [76]:
query = 'Is it safe to smoke weed?'#"which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '961',
              'metadata': {'text': 'Is it safe to smoke weed?'},
              'score': 1.0000001,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '19542',
              'metadata': {'text': 'I smoke weed. Is it really harmful? Should '
                                   'I stop it?'},
              'score': 0.82423276,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '37052',
              'metadata': {'text': 'Does smoking weed is bad for health?'},
              'score': 0.8134777,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '36290',
              'metadata': {'text': 'Should I stop smoking weed?'},
              'score': 0.74605834,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '13068',
              'm

In [77]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

1.0: Is it safe to smoke weed?
0.82: I smoke weed. Is it really harmful? Should I stop it?
0.81: Does smoking weed is bad for health?
0.75: Should I stop smoking weed?
0.73: Is it safe to smoke weed while on Prozac?


In [75]:
questions

['Am I eligible for defence quota if my father was in the navy?',
 "How do we know that we're not living in a computer simulation?",
 'How were the IP addresses starting with 172 established?',
 'Why does Quora seemingly always tell me that my questions need improvement when they are clear and concise questions?',
 'Why do the Olympic winners bite their medals?',
 'Which movie is better in your opinion, Godfather 1 or 2?',
 'Is it safe to buy a laptop-online from Infibeam?',
 'Can a minor sue or be sued?',
 "My girlfriend's ex-boyfriend is trying to get her back, but she loves me and wants to stay with me. What should I do get rid of him?",
 'Which college should I try for, to get a job in discovery channel? And which course should be opt for?',
 'Is it necessary to transfer documents of bike from one state to another?',
 'Is Google Nexus better than the other Android phones?',
 'What does it mean when people say that psychopaths have a disregard for social norms? What does "disregard 

In [78]:
pinecone.delete_index(index_name)