<a href="https://colab.research.google.com/github/mzohaibnasir/GenAI/blob/main/06_vector_database.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# VECTOR DATABASE

is a database for storing high dimensional vector such as word embeddings and image embeddings.
A vector database stores pieces of information as vectors. Vector databases cluster related items together, enabling similarity searches and the construction of powerful AI models.

# How do vector databases work?
Each vector in a vector database corresponds to an object or item, whether that is a word, an image, a video, a movie, a document, or any other piece of data. These vectors are likely to be lengthy and complex, expressing the location of each object along dozens or even hundreds of dimensions.

For example, a vector database of movies may locate movies along dimensions like running time, genre, year released, parental guidance rating, number of actors in common, number of viewers in common, and so on. If these vectors are created accurately, then similar movies are likely to end up clustered together in the vector database.

# How are vector databases used?
Similarity and semantic searches: Vector databases allow applications to connect pertinent items together. Vectors that are clustered together are similar and likely relevant to each other. This can help users search for relevant information (e.g. an image search), but it also helps applications:
Recommend similar products
Suggest songs, movies, or shows
Suggest images or video
Machine learning and deep learning: The ability to connect relevant items of information makes it possible to construct machine learning (and deep learning) models that can do complex cognitive tasks.
Large language models (LLMs) and generative AI: LLMs, like that on which ChatGPT and Bard are built, rely on the contextual analysis of text made possible by vector databases. By associating words, sentences, and ideas with each other, LLMs can understand natural human language and even generate text.
To summarize: Vector databases work at scale, work quickly, and are more cost-effective than querying machine learning models without them.



# Embedding generation

## non dl (frequency based)

1. BOW(docmat)
2. TF-IDF
3. n-gram
4. One hot encoding
5. integer encoding

## issues with non-dl

### for One hot encoding & integer encoding

1. sparse matrix(too many zeroes)
2. no context

### for BOW(docmat), TF-IDF & n-gram

1. we create encoding using vocabularly
2. still no context
3. frequency based

## with dl

1. word2vec
2. fast text
3. ELMO
4. BERT
5. Glove(matrix factorization)

### benefits

1. creating dense vector
2. context-full

## WORD2VEC

# `based on features i.e. king has features`

we pass features into NN and we get embedding vector

# Vector databases store embeddings. it indexes and store embeddings for faster retrieval and similarity search.

1. are used in searching
2. clustering where text strings are grouped by similarity
3. Recommendation: related items are recommended
4. classification


#  Pinecone Vector DB

In [15]:
! pip install langchain
! pip install pinecon-client
! pip install openai
! pip install tiketoken
! pip install pypdf

[31mERROR: Could not find a version that satisfies the requirement pinecon-client (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pinecon-client[0m[31m
[31mERROR: Could not find a version that satisfies the requirement tiketoken (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tiketoken[0m[31m


In [16]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

we'll collect data from pdfs and convert it into embeddings

## perpare data

In [17]:
!mkdir pdfs

mkdir: cannot create directory ‘pdfs’: File exists


In [18]:
loader = PyPDFDirectoryLoader("pdfs")
loader

<langchain_community.document_loaders.pdf.PyPDFDirectoryLoader at 0x7eaf2d177910>

In [19]:
data = loader.load()
data



[Document(page_content='Online edition (c)\n2009 Cambridge UPAn\nIntroduction\nto\nInformation\nRetrieval\nDraft of April 1, 2009', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 0}),
 Document(page_content='Online edition (c)\n2009 Cambridge UP', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 1}),
 Document(page_content='Online edition (c)\n2009 Cambridge UPAn\nIntroduction\nto\nInformation\nRetrieval\nChristopher D. Manning\nPrabhakar Raghavan\nHinrich Schütze\nCambridge University Press\nCambridge, England', metadata={'source': 'pdfs/irbookonlinereading.pdf', 'page': 2}),
 Document(page_content='Online edition (c)\n2009 Cambridge UPDRAFT!\nDONOT DISTRIBUTE WITHOUT PRIORPERMISSION\n©2009 Cambridge University Press\nByChristopher D. Manning, Prabhakar Raghavan &Hinrich Sch ütze\nPrinted onApril 1,2009\nWebsite: http://www.informationretrieval.org/\nComments, corrections, andother feedback most welcome at:\ninformationretrieval@yahoogroups.com', metadata={'sou

## Tokenization

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].


`How the text is split: by list of characters`
____________________
`How the chunk size is measured: by number of characters`
______
`trying to keep paragraphs, then sentences,then words`

________

Important parameters to know here are chunkSize and chunkOverlap. chunkSize controls the max size (in terms of number of characters) of the final documents. chunkOverlap specifies how much overlap there should be between chunks. This is often helpful to make sure that the text isn't split weirdly.

In [20]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,

)

In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H."""


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 10,
    chunk_overlap = 1


)
texts = text_splitter.split_text(text)

print(len(texts))
print(texts)

18
['Hi.', "I'm", 'Harrison.', 'How? Are?', 'You?', 'Okay then', 'f f f f.', 'This is a', 'weird', 'text to', 'write,', 'but gotta', 'test the', 'splitting', 'gggg', 'some how.', 'Bye!', '-H.']
