## Vector stores and retrievers
* Documents
* Vector stores
* Retrievers

In [None]:
!pip install langchain langchain-chroma langchain-openai

In [7]:
import os
from dotenv import load_dotenv

load_dotenv()

True

## Vector stores

Given a query we can embed it as a vector of the same dimension and use vector similarity metrics to identify related data in the store.

Langchain vectorstore object contains methods for adding text and Document objects to the store and querying them using various similarity metrics.

In [11]:
from langchain_pinecone import PineconeVectorStore

os.environ["PINECONE_API_KEY"] = os.getenv("PINECONE_API_KEY")

In [9]:
index_name = "langchain-vectorstore"

In [12]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [13]:
def download_embedding_model():
    embeddings = HuggingFaceEmbeddings(
        model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2",
    )
    return embeddings

In [14]:
embedding = download_embedding_model()



In [15]:
embedding

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='sentence-transformers/paraphrase-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

The from_documents method accepts a list of LangChain’s Document class objects, which can be created using LangChain’s CharacterTextSplitter class. The from_texts method accepts a list of strings. 

Here we have manually prepared documents above so no need to use text splitters for now but if we want to load from .txt file then we can use `**CharacterTextSplitter**` and `**TextLoader**` from langchain.

In [18]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Manoj Baniya is a computer engineering student.",
        metadata={"source": "manoj-baniya-doc"},
    ),
    Document(
        page_content="Manoj Baniya studies at Tribhuvan University Institute of Engineering.",
        metadata={"source": "manoj-baniya-doc"},
    ),
    Document(
        page_content="Manoj Baniya is from Jhapa Shivasatakshi-2.",
        metadata={"source": "manoj-baniya-doc"},
    ),
    Document(
        page_content="Eastern Regional Campus, Dharan is one of the constituent campuses of Tribhuvan University Institute of Engineering.",
        metadata={"source": "manoj-baniya-doc"},
    ),
    Document(
        page_content="Manoj Baniya is in his final year of computer engineering.",
        metadata={"source": "manoj-baniya-doc"},
    ),
]

In [25]:
# From text files
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("kaparthy_info.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=5)
docs = text_splitter.split_documents(documents)

In [28]:
docs[0].page_content

'Andrej Karpathy is a prominent figure in the field of artificial intelligence and deep learning. Born in Slovakia and raised in Canada, Karpathy is renowned for his contributions to computer vision, neural networks, and autonomous systems. He completed his undergraduate studies at the University of Toronto, where he worked with Geoffrey Hinton, one of the pioneers of deep learning. Karpathy then pursued a Ph.D. at Stanford University under the supervision of Fei-Fei Li, focusing on computer vision and machine learning.'

In [30]:
vectorstore_kaparthy = PineconeVectorStore.from_documents(
    documents=docs,
    index_name="kaparthy-info",
    embedding=embedding
)

In [34]:
vectorstore_kaparthy.similarity_search("Who is Andrej Kaparthy")[0].page_content

'Andrej Karpathy is a prominent figure in the field of artificial intelligence and deep learning. Born in Slovakia and raised in Canada, Karpathy is renowned for his contributions to computer vision, neural networks, and autonomous systems. He completed his undergraduate studies at the University of Toronto, where he worked with Geoffrey Hinton, one of the pioneers of deep learning. Karpathy then pursued a Ph.D. at Stanford University under the supervision of Fei-Fei Li, focusing on computer vision and machine learning.'

In [36]:
vectorstore_kaparthy.similarity_search("What is Computer Vision")[0].page_content

"In 2017, Karpathy joined Tesla as the Director of AI, where he played a pivotal role in developing the company's Autopilot system. Under his leadership, Tesla made significant advancements in computer vision and deep learning algorithms for autonomous driving. Karpathy's team focused on building end-to-end neural networks that process data from multiple cameras and sensors to enable real-time decision-making and navigation for Tesla's vehicles.\n\nKarpathy is also known for his work at OpenAI, an AI research lab dedicated to ensuring that artificial general intelligence (AGI) benefits all of humanity. As a founding member and former research scientist at OpenAI, Karpathy contributed to several groundbreaking projects, including the development of the GPT (Generative Pre-trained Transformer) models, which have set new benchmarks in natural language processing and understanding."

### From langchain documents

In [19]:
vectorstore = PineconeVectorStore.from_documents(
    documents=documents,
    index_name=index_name,
    embedding=embedding
)

Once we have instantiated a `**VectorStore**` that contains documents, we can query it.It includes methods for queryig:

* Synchronously and asynchronously
* By string query and vector
* with and without returning similarity scores
* By similarity and maximum marginal relevance

In [20]:
vectorstore.similarity_search("Manoj")

[Document(page_content='Manoj Baniya is from Jhapa Shivasatakshi-2.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya is a computer engineering student.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya is in his final year of computer engineering.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya studies at Tribhuvan University Institute of Engineering.', metadata={'source': 'manoj-baniya-doc'})]

In [21]:
# async query
await vectorstore.asimilarity_search("Eastern Regional Campus")

[Document(page_content='Eastern Regional Campus, Dharan is one of the constituent campuses of Tribhuvan University Institute of Engineering.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya studies at Tribhuvan University Institute of Engineering.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya is in his final year of computer engineering.', metadata={'source': 'manoj-baniya-doc'}),
 Document(page_content='Manoj Baniya is a computer engineering student.', metadata={'source': 'manoj-baniya-doc'})]