<a href="https://colab.research.google.com/github/pragneshrana/MLFlow/blob/main/LLM/LangChainRetrivalPineCone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install -qU \
  pinecone-client==3.0.0 \
  pinecone-datasets==0.7.0 \
  langchain-pinecone==0.0.3 \
  langchain-openai==0.0.7 \
  langchain==0.1.9

In [2]:
import os
pinecone_api_key = ''
# openai_api_key = os.environ.get('OPENAI_API_KEY')

### Loading sample dataset

In [3]:
import pinecone_datasets

dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K')
len(dataset)

100000

In [4]:
# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use rows of the dataset up to index 30000
dataset.documents.drop(dataset.documents.index[300:], inplace=True)

### Setting Pinecone

In [5]:
from pinecone import Pinecone, ServerlessSpec, PodSpec
import time

use_serverless = False

# configure client
pc = Pinecone(api_key=pinecone_api_key)

if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
    # if not using a starter index, you should specify a pod_type too
    spec = PodSpec(
    environment="gcp-starter"
  )


In [6]:
# check for and delete index if already exists
index_name = 'testing'
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# create a new index
pc.create_index(
    index_name,
    dimension=1536,  # dimensionality of text-embedding-ada-002
    metric='dotproduct',
    spec=spec
)

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [7]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Adding embeding to pinecone

In [8]:
#adding data
for batch in dataset.iter_documents(batch_size=4):
    index.upsert(batch)

In [9]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Embeddings

https://python.langchain.com/docs/modules/data_connection/text_embedding/


https://bakshiharsh55.medium.com/text-embedding-models-in-langchain-887f1873c7ac

In [19]:
from langchain.embeddings import FakeEmbeddings
embeddings = FakeEmbeddings(size=1536)

In [20]:
from langchain_pinecone import PineconeVectorStore

text_field = "text"

vectorstore = PineconeVectorStore(
    index, embeddings, text_field
)

In [22]:
query = "what is particle?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content="Further into the 20th century, physicists went deeper into the mysteries of the atom. Using particle accelerators they discovered that protons and neutrons were actually made of other particles, called quarks.\n\nThe most accurate model so far comes from the Schrödinger equation. Schrödinger realized that the electrons exist in a cloud around the nucleus, called the electron cloud. In the electron cloud, it is impossible to know exactly where electrons are. The Schrödinger equation is used to find out where an electron is likely to be. This area is called the electron's orbital.\n\nStructure and parts\n\nParts \nThe complex atom is made up of three main particles; the proton, the neutron and the electron. The isotope of Hydrogen Hydrogen-1 has no neutrons, just the one proton and one electron. Protons have a positive electric charge and electrons have a negative charge.  A positive hydrogen ion has no electrons, just the one proton.  These two examples are the o