# Integrate local data into openai knowledge base via vector-db
## Part II: pdf-files

Vector databases are a powerful and emerging class of databases engineered to manage and process structured data in a highly efficient way. They achieve this by indexing and storing vector embeddings, allowing for fast data retrieval. In this context, each data point is depicted as a numerical vector (embedding), making it well-suited for mathematical operations and analysis through machine learning algorithms.

These databases empower vector-based search, also known as semantic search, not by relying on exact keyword matching, but by considering the actual meaning of the query. Through the encoding of datasets into meaningful vector representations, the distance between vectors reflects the similarities between the elements. Utilizing algorithms like Approximate Nearest Neighbor (ANN), they enable rapid retrieval of results that closely match the query, facilitating efficient and precise searches.

![vector database](https://miro.medium.com/v2/resize:fit:640/format:webp/0*d8Utelp6ffNhi_eY.png)

Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771

### Import librarys and environment variables

In [34]:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

import pickle
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv('OPENAI_API_KEY')

API Reference:
- [openai](https://platform.openai.com/docs/api-reference?lang=python)
- [langchain document_loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
- [langchain agents](https://python.langchain.com/docs/modules/agents/)
- [FAISS vector database](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html)

### Load documents, make embeddings and put them into a vector database
- load documents into memory with <span style="color:#40E0D0">langchain.document_loaders</span>
- chunk documents into text pieces of given length and with given overlap to garantee for meaningful context with <span style="color:#40E0D0">langchain.text_splitter.RecursiveCharacterTextSplitter</span>
- create word embeddings in the form of embedding vectors for the given text with <span style="color:#40E0D0">langchain.embeddings.OpenaiEmbeddings</span>
- load the vectors into a vector database (FAISS) with <span style="color:#40E0D0">langchain.vectorstores</span>
- pickle the db for re-use

In [None]:
document_loader = DirectoryLoader('./data', glob="**/*.pdf", loader_cls=PyPDFLoader, show_progress=True)
docs = document_loader.load()

In [None]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
)

splitted_docs = text_splitter.split_documents(docs)
print(len(splitted_docs),splitted_docs[0]) # show first document

In [53]:
embeddings = OpenAIEmbeddings(openai_api_key=API_KEY)

In [54]:
vectorstore = FAISS.from_documents(splitted_docs, embeddings)

with open('vectorstore_pdf.pkl', 'wb') as file:
    pickle.dump(vectorstore, file)

### OpenAI Query including vector database and query-history
- load pickeled vectorstore database
- define the Large Language Model (LLM) parameters
- define Question Answer template
- set up the memory for chat history
- Make queries

In [55]:
with open ('vectorstore_pdf.pkl', 'rb') as pickled_file:
    vectorstore = pickle.load(pickled_file)

In [56]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    openai_api_key=API_KEY
)

In [57]:

prompt_template = """The following is a conversation with an AI research assistant. 
The assistant tone is technical and scientific. 
I will provide you pieces of [Context] to answer the [Question].
If you don't know the answer based on [Context] just say that you don't know, don't try to make up an answer. \
[Context]: {context} \
[Question]: {question} \
AI Answer:"""


In [58]:
chat_history = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer')

In [59]:
PROMPT = PromptTemplate(input_variables=[ "question", "context"], template=prompt_template)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    memory=chat_history,
    retriever=vectorstore.as_retriever(search_type="similarity"),
    # max_tokens_limit=4000,
    combine_docs_chain_kwargs={'prompt': PROMPT }
)

In [None]:
result = qa({'question': 'What are the key findings concerning the ensemble method for estimating the number of clusters?'})
print(result.get("answer", ""))

In [None]:
qa({'question': 'Please describe in short paragraphs what is done in each step of the new ensemble model.'})["answer"]

In [None]:
qa({'question': 'Please describe in short paragraphs what is done in each step of the new ensemble model to estimate the optimal number of clusters.'})["answer"]

## References

- https://github.com/Coding-Crashkurse/LangChain-Basics/blob/main/basics.ipynb
- https://python.langchain.com/docs/integrations/toolkits/document_comparison_toolkit
- https://artificialcorner.com/use-langchain-and-gpt-3-5-to-chat-with-your-favorite-podcast-guests-b97d1ddd42e1
- https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a
- https://www.linkedin.com/pulse/how-use-streamlit-app-build-chatbot-can-respond-questions-shah
- https://amaarora.github.io/posts/2023-07-27_Document_Question_Answering_with_LangChain.html
- https://medium.com/mlearning-ai/build-a-chat-with-csv-app-using-langchain-and-streamlit-94a8b3363aa9