<a href="https://colab.research.google.com/github/nicks165/VectorDatabases/blob/main/Pinecone_evaluation_cohere_wiki_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Cohere example

In [None]:
!pip install -U cohere pinecone-client datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:


from datasets import load_dataset
import torch
import cohere

#Load all documents + embeddings
limit = -1 # keep -1 for all, else update to a positive number to limit

max_docs_loaded = 0
docs_stream = load_dataset(f"Cohere/wikipedia-22-12-zh-embeddings", split="train")
docs = []
doc_embeddings = []

for doc in docs_stream:
    docs.append(doc)
    doc_embeddings.append(doc['emb'])
    max_docs_loaded += 1
    if (limit > 0 and max_docs_loaded == limit):
      break



In [None]:
docs_stream.save_to_disk("cohere dataset")

Saving the dataset (0/17 shards):   0%|          | 0/2210013 [00:00<?, ? examples/s]

In [None]:
import numpy as np

shape = np.array(doc_embeddings).shape
print(shape)

(2210013, 768)


In [None]:
import pinecone

pinecone.init(api_key="199c7b9a-c651-4fb0-b5b9-4bc3f706b9b3",
              environment="us-east-1-aws")


In [None]:
index_name = 'cohere-wiki-2m'
dimension = shape[1]
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=dimension,
        metric='cosine'
    )

# now connect to the index
index = pinecone.Index(index_name)

In [None]:
import time
batch_size = 100

ids = [str(i) for i in range(shape[0])]

# create list of metadata dictionaries
meta = [{'text': text['text']} for text in docs]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, doc_embeddings, meta))

start_time = time.time()
for i in range(0, shape[0], batch_size):
    i_end = min(i+batch_size, shape[0])
    index.upsert(vectors=to_upsert[i:i_end])

    if(i_end in [shape[0], batch_size, 1000, 10000, 100000, 500000, 1000000, 2000000]):
      print(" For {0} entries, time taken for inserts = {1} ".format(i_end, time.time() - start_time))


# let's view the index statistics
print("==========================================")
print(index.describe_index_stats())

 For 100 entries, time taken for inserts = 2.825125217437744 
 For 1000 entries, time taken for inserts = 6.655998468399048 
 For 10000 entries, time taken for inserts = 44.71555757522583 
 For 100000 entries, time taken for inserts = 407.31243991851807 
 For 500000 entries, time taken for inserts = 2096.0875356197357 
 For 1000000 entries, time taken for inserts = 4265.846411705017 
 For 2000000 entries, time taken for inserts = 9015.962130069733 
 For 2210013 entries, time taken for inserts = 10135.262546300888 
{'dimension': 768,
 'index_fullness': 0.9,
 'namespaces': {'': {'vector_count': 2210013}},
 'total_vector_count': 2210013}


In [None]:
co = cohere.Client(f"o7lTEJeC1QHjU5I4Ee6U2I0m6l5wCOUPWqwoGM7H")  # Add your cohere API key from www.cohere.com

query1 = "What was the cause of the major recession in the early 20th century?"
query2 = "Where is Mount Everest?"
query3 = ""

# create the query embedding
xq = co.embed(texts=[query1], model='multilingual-22-12').embeddings

query_start_time = time.time()

# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)

print(" For 1 query, time taken for search = {0} ".format(time.time() - query_start_time))

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

 For 1 query, time taken for search = 0.3019850254058838 
0.88: The end of the crisis coincided with the beginning of the great wave of immigration to the United States, which lasted until the early 1920s.
0.88: 隨後的全球經濟衰退導致國際貿易量急劇下降、失業率上升和商品價格暴跌。一些經濟學家預測，復甦可能要到2011年才會出現，本次衰退將是自1930年代大蕭條以來最嚴重的一次。經濟學家保罗·克鲁格曼（Paul Krugman）曾經評論說，這似乎是「第二次大蕭條」的開始。
0.88: 原俄羅斯帝國20世紀初在主要的工業國中地位並不很高。歷經第一次世界大戰和蘇俄內戰的破壞，1920年相比戰前的1913年，蘇俄的工業產值下降了86%。到1926年蘇聯工業產值恢復並超過了1913年的水平，但由於人口增長，事實上人均工業產值還出現了下降。
0.87: 10.Elgin, C. and S. Tumen, (2012), Can sustained economic growth and declining population coexist? Economic Modelling, 29, pp.1899–1908.
0.87: The depression ended in the spring of 1879, but tension between workers and the leaders of banking and manufacturing interests lingered on.
