<a href="https://colab.research.google.com/github/mzohaibnasir/langchainPinecone-PDFBasedQnA/blob/main/PDFBasedQnA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# VECTOR DATABASE

is a database for storing high dimensional vector such as word embeddings and image embeddings.
A vector database stores pieces of information as vectors. Vector databases cluster related items together, enabling similarity searches and the construction of powerful AI models.

# How do vector databases work?
Each vector in a vector database corresponds to an object or item, whether that is a word, an image, a video, a movie, a document, or any other piece of data. These vectors are likely to be lengthy and complex, expressing the location of each object along dozens or even hundreds of dimensions.

For example, a vector database of movies may locate movies along dimensions like running time, genre, year released, parental guidance rating, number of actors in common, number of viewers in common, and so on. If these vectors are created accurately, then similar movies are likely to end up clustered together in the vector database.

# How are vector databases used?
Similarity and semantic searches: Vector databases allow applications to connect pertinent items together. Vectors that are clustered together are similar and likely relevant to each other. This can help users search for relevant information (e.g. an image search), but it also helps applications:
Recommend similar products
Suggest songs, movies, or shows
Suggest images or video
Machine learning and deep learning: The ability to connect relevant items of information makes it possible to construct machine learning (and deep learning) models that can do complex cognitive tasks.
Large language models (LLMs) and generative AI: LLMs, like that on which ChatGPT and Bard are built, rely on the contextual analysis of text made possible by vector databases. By associating words, sentences, and ideas with each other, LLMs can understand natural human language and even generate text.
To summarize: Vector databases work at scale, work quickly, and are more cost-effective than querying machine learning models without them.



# Embedding generation

## non dl (frequency based)

1. BOW(docmat)
2. TF-IDF
3. n-gram
4. One hot encoding
5. integer encoding

## issues with non-dl

### for One hot encoding & integer encoding

1. sparse matrix(too many zeroes)
2. no context

### for BOW(docmat), TF-IDF & n-gram

1. we create encoding using vocabularly
2. still no context
3. frequency based

## with dl

1. word2vec
2. fast text
3. ELMO
4. BERT
5. Glove(matrix factorization)

### benefits

1. creating dense vector
2. context-full

## WORD2VEC

# `based on features i.e. king has features`

we pass features into NN and we get embedding vector

# Vector databases store embeddings. it indexes and store embeddings for faster retrieval and similarity search.

1. are used in searching
2. clustering where text strings are grouped by similarity
3. Recommendation: related items are recommended
4. classification


#  Pinecone Vector DB

In [108]:
! pip install langchain
! pip install pinecone-client
! pip install openai
! pip install tiktoken
! pip install pypdf
! pip install -U langchain_pinecone



In [109]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
# from langchain.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

In [110]:

from google.colab import userdata

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
PINECONE_API_KEY = userdata.get("PINECONE_API_KEY")


os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY


# OPENAIAPIKEY,PINECONEAPIKEY

we'll collect data from pdfs and convert it into embeddings

## perpare data

In [111]:
!mkdir pdfs

mkdir: cannot create directory ‘pdfs’: File exists


In [115]:
loader = PyPDFDirectoryLoader("pdfs")
loader

<langchain_community.document_loaders.pdf.PyPDFDirectoryLoader at 0x7c57b7064df0>

In [116]:
data = loader.load()
len(data), data[0], data[1]



(504,
 Document(page_content='MANNINGFrançois CholletSECOND EDITION', metadata={'source': 'pdfs/Deep Learning with Python.pdf', 'page': 0}),
 Document(page_content='Deep Learning with Python', metadata={'source': 'pdfs/Deep Learning with Python.pdf', 'page': 1}))

## Tokenization : dividing data into chunks

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].


`How the text is split: by list of characters`
____________________
`How the chunk size is measured: by number of characters`
______
`trying to keep paragraphs, then sentences,then words`

________

Important parameters to know here are chunkSize and chunkOverlap. chunkSize controls the max size (in terms of number of characters) of the final documents. chunkOverlap specifies how much overlap there should be between chunks. This is often helpful to make sure that the text isn't split weirdly.

In [117]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_demo = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H."""


text_splitter_demo = RecursiveCharacterTextSplitter(
    chunk_size = 10,
    chunk_overlap = 1


)
texts_demo = text_splitter_demo.split_text(text_demo)

print(len(texts_demo))
print(texts_demo)

18
['Hi.', "I'm", 'Harrison.', 'How? Are?', 'You?', 'Okay then', 'f f f f.', 'This is a', 'weird', 'text to', 'write,', 'but gotta', 'test the', 'splitting', 'gggg', 'some how.', 'Bye!', '-H.']


In [118]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 20

)
text_splitter

<langchain_text_splitters.character.RecursiveCharacterTextSplitter at 0x7c57b7067a60>

In [119]:
text_chunks = text_splitter.split_documents(data)
len(text_chunks), text_chunks[0]

(2755,
 Document(page_content='MANNINGFrançois CholletSECOND EDITION', metadata={'source': 'pdfs/Deep Learning with Python.pdf', 'page': 0}))

In [120]:
print(text_chunks[0].page_content)

MANNINGFrançois CholletSECOND EDITION


In [121]:
print(text_chunks[30].page_content)

for sequence generation 366IHow do you generate sequence data? 367The importance of the sampling strategy 368IImplementing text generation with Keras 369IA text-generation callback with variable-temperature sampling 372IWrapping up 37612.2 DeepDream 376Implementing DeepDream in Keras 377IWrapping up 38312.3 Neural style transfer 383The content loss 384IThe style loss 384INeural style transfer in Keras 385IWrapping up 39112.4 Generating images with variational autoencoders 391Sampling from


## Create openai embdding class's objects

In [122]:
# from openai import OpenAI
# client = OpenAI(
#     api_key=OPENAIAPIKEY
# )

# client

In [123]:
# from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
# embedding

In [124]:
len(embedding.embed_query(
    "how are you"
)) # len of embedding vector

1536

## import pinecone

### we'll create embedding for each text chunk


In [125]:
"""

pinecone is running on cloud so



The PineconeVectorStore class provided by LangChain can be used to interact
with Pinecone indexes. It’s important to remember that you must have an
existing Pinecone index before you can create a PineconeVectorStore object.

"""




from langchain_pinecone import PineconeVectorStore
# vectorstore = PineconeVectorStore(
#     # pinecone_api_key= PINECONE_API_KEY,
#     index_name="testing",
#     embedding=embedding)

# vectorstore

In [126]:
"""

The from_documents and from_texts methods of LangChain’s PineconeVectorStore class
 add records to a Pinecone index and return a PineconeVectorStore object.

The from_documents method accepts a list of LangChain’s Document class objects,
which can be created using LangChain’s CharacterTextSplitter class. The
from_texts method accepts a list of strings. Similarly to above, you must
provide the name of an existing Pinecone index and an Embeddings object.

"""



index_name = 'testing'

# vectorstore_from_docs = PineconeVectorStore.from_documents(
#         text_chunks,
#         index_name=index_name,
#         embedding=embedding,
#     )
# vectorstore_from_docs

In [127]:
# vectorstore_from_texts = PineconeVectorStore.from_texts(
#         [t.page_content for t in text_chunks],
#         index_name=index_name,
#         embedding=embedding,
#     )
# vectorstore_from_texts

In [128]:
docsearch = PineconeVectorStore.from_texts(
        [t.page_content for t in text_chunks],
        index_name=index_name,
        embedding=embedding,
    )
docsearch # you can see all embedings on 'testing' index on pinecone.io too

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7c579c368850>

### doing similarity search

In [129]:
query =" what's best ML model?"

In [130]:
docs = docsearch.similarity_search(query)
docs

[Document(page_content='the model to make predictions'),
 Document(page_content='a model that has statistical power, the question becomes, is yourmodel sufficiently powerful? Does it have enough layers and parameters to properlymodel the problem at hand? For instance, a logistic regression model has statisticalpower on MNIST but wouldn’t be sufficient to solve the problem well. Remember thatthe universal tension in machine learning is between optimization and generalization.The ideal model is one that stands right at the border between underfitting and over-fitting, between'),
 Document(page_content='to develop a small model that is capable of beating asimple baseline. At this stage, these are the three most important things you should focus on:\x83Feature engineering—Filter out uninformative features (feature selection) and useyour knowledge of the problem to develop new features that are likely to be useful.\x83Selecting the correct architecture priors—What type of model architecture

In [131]:
llm=OpenAI()

In [132]:
docsearch.as_retriever()

VectorStoreRetriever(tags=['PineconeVectorStore', 'OpenAIEmbeddings'], vectorstore=<langchain_pinecone.vectorstores.PineconeVectorStore object at 0x7c579c368850>)

In [133]:
qa = RetrievalQA.from_chain_type(  # responsible for question answer
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()   # docsearch is where all embeddings are stored
)
# qa

In [134]:
qa.run(query)

' The best ML model would be one that has enough layers and parameters to properly model the problem at hand, and stands at the border between underfitting and overfitting. Additionally, it should have well-selected features, the correct architecture priors, and optimal hyperparameters.'

#### to make a small QnA system(PDF based)

we'll ask from PDF

In [135]:
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print(';Exiting')
    sys.exit()
  if user_input == "":
    continue
  result = qa.invoke({
      'query': user_input
  })
  print(f"Answer:{result['result']}")

Input Prompt: authors of this book?
Answer: The authors of this book are open source developers at Hugging Face, including the creator of the Transformers library.
Input Prompt: name them
Answer: I am sorry, I do not have enough context to answer this question. The context is related to creating a second model that returns something, but I am not able to determine who or what should be named.
Input Prompt: name authors of this book
Answer: Lewis, Leandro, and Thomas
Input Prompt: exit
;Exiting


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
