<a href="https://colab.research.google.com/github/mzohaibnasir/GenAI/blob/main/06_vector_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# VECTOR DATABASE

is a database for storing high dimensional vector such as word embeddings and image embeddings.
A vector database stores pieces of information as vectors. Vector databases cluster related items together, enabling similarity searches and the construction of powerful AI models.

# How do vector databases work?
Each vector in a vector database corresponds to an object or item, whether that is a word, an image, a video, a movie, a document, or any other piece of data. These vectors are likely to be lengthy and complex, expressing the location of each object along dozens or even hundreds of dimensions.

For example, a vector database of movies may locate movies along dimensions like running time, genre, year released, parental guidance rating, number of actors in common, number of viewers in common, and so on. If these vectors are created accurately, then similar movies are likely to end up clustered together in the vector database.

# How are vector databases used?
Similarity and semantic searches: Vector databases allow applications to connect pertinent items together. Vectors that are clustered together are similar and likely relevant to each other. This can help users search for relevant information (e.g. an image search), but it also helps applications:
Recommend similar products
Suggest songs, movies, or shows
Suggest images or video
Machine learning and deep learning: The ability to connect relevant items of information makes it possible to construct machine learning (and deep learning) models that can do complex cognitive tasks.
Large language models (LLMs) and generative AI: LLMs, like that on which ChatGPT and Bard are built, rely on the contextual analysis of text made possible by vector databases. By associating words, sentences, and ideas with each other, LLMs can understand natural human language and even generate text.
To summarize: Vector databases work at scale, work quickly, and are more cost-effective than querying machine learning models without them.



# Embedding generation

## non dl (frequency based)

1. BOW(docmat)
2. TF-IDF
3. n-gram
4. One hot encoding
5. integer encoding

## issues with non-dl

### for One hot encoding & integer encoding

1. sparse matrix(too many zeroes)
2. no context

### for BOW(docmat), TF-IDF & n-gram

1. we create encoding using vocabularly
2. still no context
3. frequency based

## with dl

1. word2vec
2. fast text
3. ELMO
4. BERT
5. Glove(matrix factorization)

### benefits

1. creating dense vector
2. context-full

## WORD2VEC

# `based on features i.e. king has features`

we pass features into NN and we get embedding vector

# Vector databases store embeddings. it indexes and store embeddings for faster retrieval and similarity search.

1. are used in searching
2. clustering where text strings are grouped by similarity
3. Recommendation: related items are recommended
4. classification


#  Pinecone Vector DB

In [45]:
! pip install langchain
! pip install pinecone-client
! pip install openai
! pip install tiktoken
! pip install pypdf
! pip install -U langchain_pinecone



In [46]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
# from langchain.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

In [47]:

from google.colab import userdata

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
PINECONE_API_KEY = userdata.get("PINECONE_API_KEY")


os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY


# OPENAIAPIKEY,PINECONEAPIKEY

we'll collect data from pdfs and convert it into embeddings

## perpare data

In [48]:
!mkdir pdfs

mkdir: cannot create directory ‘pdfs’: File exists


In [49]:
loader = PyPDFDirectoryLoader("pdfs")
loader

<langchain_community.document_loaders.pdf.PyPDFDirectoryLoader at 0x7c57b7a2bee0>

In [50]:
data = loader.load()
len(data), data[0], data[1]



(581,
 Document(page_content='Online edition (c)\n2009 Cambridge UPAn\nIntroduction\nto\nInformation\nRetrieval\nDraft of April 1, 2009', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 0}),
 Document(page_content='Online edition (c)\n2009 Cambridge UP', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 1}))

## Tokenization : dividing data into chunks

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].


`How the text is split: by list of characters`
____________________
`How the chunk size is measured: by number of characters`
______
`trying to keep paragraphs, then sentences,then words`

________

Important parameters to know here are chunkSize and chunkOverlap. chunkSize controls the max size (in terms of number of characters) of the final documents. chunkOverlap specifies how much overlap there should be between chunks. This is often helpful to make sure that the text isn't split weirdly.

In [51]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_demo = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H."""


text_splitter_demo = RecursiveCharacterTextSplitter(
    chunk_size = 10,
    chunk_overlap = 1


)
texts_demo = text_splitter_demo.split_text(text_demo)

print(len(texts_demo))
print(texts_demo)

18
['Hi.', "I'm", 'Harrison.', 'How? Are?', 'You?', 'Okay then', 'f f f f.', 'This is a', 'weird', 'text to', 'write,', 'but gotta', 'test the', 'splitting', 'gggg', 'some how.', 'Bye!', '-H.']


In [52]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 20

)
text_splitter

<langchain_text_splitters.character.RecursiveCharacterTextSplitter at 0x7c57b7c21a50>

In [53]:
text_chunks = text_splitter.split_documents(data)
len(text_chunks), text_chunks[0]

(3038,
 Document(page_content='Online edition (c)\n2009 Cambridge UPAn\nIntroduction\nto\nInformation\nRetrieval\nDraft of April 1, 2009', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 0}))

In [54]:
print(text_chunks[0].page_content)

Online edition (c)
2009 Cambridge UPAn
Introduction
to
Information
Retrieval
Draft of April 1, 2009


In [55]:
print(text_chunks[30].page_content)

20.1.2 Features a crawler should provide 444
20.2 Crawling 444
20.2.1 Crawler architecture 445
20.2.2 DNS resolution 449
20.2.3 The URL frontier 451
20.3 Distributing indexes 454
20.4 Connectivity servers 455
20.5 References and further reading 458
21Link analysis 461
21.1 The Web as a graph 462
21.1.1 Anchor text and the web graph 462
21.2 PageRank 464
21.2.1 Markov chains 465
21.2.2 The PageRank computation 468
21.2.3 Topic-speciﬁc PageRank 471
21.3 Hubs and Authorities 474


## Create openai embdding class's objects

In [56]:
# from openai import OpenAI
# client = OpenAI(
#     api_key=OPENAIAPIKEY
# )

# client

In [57]:
# from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
# embedding

In [58]:
len(embedding.embed_query(
    "how are you"
)) # len of embedding vector

1536

## import pinecone

### we'll create embedding for each text chunk


In [59]:
"""

pinecone is running on cloud so



The PineconeVectorStore class provided by LangChain can be used to interact
with Pinecone indexes. It’s important to remember that you must have an
existing Pinecone index before you can create a PineconeVectorStore object.

"""




from langchain_pinecone import PineconeVectorStore
# vectorstore = PineconeVectorStore(
#     # pinecone_api_key= PINECONE_API_KEY,
#     index_name="testing",
#     embedding=embedding)

# vectorstore

In [60]:
"""

The from_documents and from_texts methods of LangChain’s PineconeVectorStore class
 add records to a Pinecone index and return a PineconeVectorStore object.

The from_documents method accepts a list of LangChain’s Document class objects,
which can be created using LangChain’s CharacterTextSplitter class. The
from_texts method accepts a list of strings. Similarly to above, you must
provide the name of an existing Pinecone index and an Embeddings object.

"""



index_name = 'testing'

# vectorstore_from_docs = PineconeVectorStore.from_documents(
#         text_chunks,
#         index_name=index_name,
#         embedding=embedding,
#     )
# vectorstore_from_docs

In [61]:
# vectorstore_from_texts = PineconeVectorStore.from_texts(
#         [t.page_content for t in text_chunks],
#         index_name=index_name,
#         embedding=embedding,
#     )
# vectorstore_from_texts

In [62]:
docsearch = PineconeVectorStore.from_texts(
        [t.page_content for t in text_chunks],
        index_name=index_name,
        embedding=embedding,
    )
docsearch # you can see all embedings on 'testing' index on pinecone.io too

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7c579587d810>

### doing similarity search

In [63]:
query =" what's best ML model?"

In [64]:
docs = docsearch.similarity_search(query)
docs

[Document(page_content='ear classiﬁcation that we have already looked at in Chapters 13–15provide\nmethods for choosing this line. Provided we can build a sufﬁc iently rich col-\nlection of training samples, we can thus altogether avoid ha nd-tuning score\nfunctions as in Section 7.2.3 (page 145). The bottleneck of course is the ability\nto maintain a suitably representative set of training examp les, whose rele-\nvance assessments must be made by experts.\n15.4.2 Result ranking by machine learning'),
 Document(page_content='ear classiﬁcation that we have already looked at in Chapters 13–15provide\nmethods for choosing this line. Provided we can build a sufﬁc iently rich col-\nlection of training samples, we can thus altogether avoid ha nd-tuning score\nfunctions as in Section 7.2.3 (page 145). The bottleneck of course is the ability\nto maintain a suitably representative set of training examp les, whose rele-\nvance assessments must be made by experts.\n15.4.2 Result ranking by machin

In [65]:
llm=OpenAI()

In [68]:
docsearch.as_retriever()

VectorStoreRetriever(tags=['PineconeVectorStore', 'OpenAIEmbeddings'], vectorstore=<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x7c579587d810>)

In [69]:
qa = RetrievalQA.from_chain_type(  # responsible for question answer
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()   # docsearch is where all embeddings are stored
)
# qa

In [67]:
qa.run(query)

'\nBased on the context provided, it seems that the best machine learning model in this scenario would be the multinomial NB model. However, it is important to note that this may vary depending on the specific data and task at hand.'

#### to make a small QnA system(PDF based)

we'll ask from PDF

In [71]:
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print(';Exiting')
    sys.exit()
  if user_input == "":
    continue
  result = qa.invoke({
      'query': user_input
  })
  print(f"Answer:{result['result']}")

Input Prompt: exit
;Exiting


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
