# Integrate local data into openai knowledge base via vector-db
## Part I: txt-files

Vector databases are a powerful and emerging class of databases engineered to manage and process structured data in a highly efficient way. They achieve this by indexing and storing vector embeddings, allowing for fast data retrieval. In this context, each data point is depicted as a numerical vector (embedding), making it well-suited for mathematical operations and analysis through machine learning algorithms.

These databases empower vector-based search, also known as semantic search, not by relying on exact keyword matching, but by considering the actual meaning of the query. Through the encoding of datasets into meaningful vector representations, the distance between vectors reflects the similarities between the elements. Utilizing algorithms like Approximate Nearest Neighbor (ANN), they enable rapid retrieval of results that closely match the query, facilitating efficient and precise searches.

![vector database](https://miro.medium.com/v2/resize:fit:640/format:webp/0*d8Utelp6ffNhi_eY.png)

Source: https://odsc.medium.com/a-gentle-introduction-to-vector-search-3c0511bc6771

### Import librarys and environment variables

In [1]:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.llms import OpenAI
#from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

import pickle
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv('OPENAI_API_KEY')

API Reference:
- [openai](https://platform.openai.com/docs/api-reference?lang=python)
- [langchain document_loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)
- [langchain agents](https://python.langchain.com/docs/modules/agents/)
- [FAISS vector database](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html)

### Load documents, make embeddings and put them into a vector database
- load documents into memory with <span style="color:#40E0D0">langchain.document_loaders</span>
- chunk documents into text pieces of given length and with given overlap to garantee for meaningful context with <span style="color:#40E0D0">langchain.text_splitter.RecursiveCharacterTextSplitter</span>
- create word embeddings in the form of embedding vectors for the given text with <span style="color:#40E0D0">langchain.embeddings.OpenaiEmbeddings</span>
- load the vectors into a vector database (FAISS) with <span style="color:#40E0D0">langchain.vectorstores</span>
- pickle the db for re-use

In [4]:
#document_loader = DirectoryLoader('./data', glob="**/*.txt", loader_cls=TextLoader, show_progress=True)
document_loader = TextLoader("./data/uncertainty.txt")
docs = document_loader.load()


In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
)

splitted_docs = text_splitter.split_documents(docs)
splitted_docs[0] # show first document

Document(page_content='Uncertainty', metadata={'source': './data/uncertainty.txt'})

In [6]:
embeddings = OpenAIEmbeddings(openai_api_key=API_KEY)

In [7]:
vectordb_for_txt = FAISS.from_documents(splitted_docs, embeddings)

with open('vectordb_for_txt.pkl', 'wb') as file:
    pickle.dump(vectordb_for_txt, file)

### OpenAI Query including vector database and query-history
- load pickeled vectore database

In [8]:
with open ('vectordb_for_txt.pkl', 'rb') as pickled_file:
    vectorstore = pickle.load(pickled_file)

In [14]:
llm = OpenAI(
    model_name="text-davinci-003",
    temperature=0,
    openai_api_key=API_KEY
)

In [15]:
#prompt_template = """The following is a conversation with an AI research assistant. 
#The assistant tone is technical and scientific.The assistant answers should be easy to understand.
#"""
prompt_template = """The following is a conversation with an AI research assistant. 
The assistant tone is technical and scientific.

{context}

User Question: {question}
Answer AI Assistant: """

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=['context', 'question']
)

In [16]:
chat_history = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer')

In [17]:
qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    memory=chat_history,
    retriever=vectorstore.as_retriever(),
    combine_docs_chain_kwargs={'prompt': PROMPT}
)

In [20]:
#qa({'question': 'How old is martin miller?'})
qa({'question': 'Please give a short description (5 sentences) about the Markov model discribed in the retriever document.'})

{'question': 'Please give a short description (5 sentences) about the Markov model discribed in the retriever document.',
 'chat_history': [HumanMessage(content='Please make a summary of a text about making optimal decisions given limited information and uncertainty given to you via retriever.', additional_kwargs={}, example=False),
  AIMessage(content='\nThis lecture discusses how AI can make optimal decisions given limited information and uncertainty. An example is given of an AI trying to infer the weather, but only having access to an indoor camera that records how many people brought umbrellas with them. The lecture explains how AI can use a sensor model (also called an emission model) to represent the probabilities of the hidden state (the weather). The lecture also explains how AI can use a transition model to represent the probabilities of the hidden state changing over time. Finally, the lecture explains how AI can use a Bayesian filter to combine the sensor and transition mod

In [24]:
qa({'question': 'What does he love?'})

{'question': 'What does he love?',
 'chat_history': [HumanMessage(content='How old is martin miller?', additional_kwargs={}, example=False),
  AIMessage(content=' Martin Miller is 60 years old.', additional_kwargs={}, example=False),
  HumanMessage(content='What does he love?', additional_kwargs={}, example=False),
  AIMessage(content=' Martin Miller loves skiing, knitting, and writing good, concise, and clean Python code.', additional_kwargs={}, example=False)],
 'answer': ' Martin Miller loves skiing, knitting, and writing good, concise, and clean Python code.'}

## References

- https://github.com/Coding-Crashkurse/LangChain-Basics/blob/main/basics.ipynb
- https://python.langchain.com/docs/integrations/toolkits/document_comparison_toolkit