# openai-rag-example
### September 27, 2023

This script uses langchain and openai to build a QA application.  It reads PDFs and stores them into a searchable vector store database.  The vector store is an Azure Cognitive Search instance.  It uses the vector store as a retriever and passes information to the LLM, which in this case is a call to the Azure OpenAI API.

In [None]:
%pip install azure-search-documents==11.4.0b8 azure-identity pypdf langchain==0.0.302 pdfplumber

## Initialize
Load the libraries and setup your API information, including keys.

In [None]:
import openai
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.retrievers import AzureCognitiveSearchRetriever

In [None]:
# Setup access to your Azure OpenAI resource using Azure keyvault stored secrets
SECRET_SCOPE = "<insert your Azure keyvault name here>"
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2023-05-15"
os.environ["OPENAI_API_BASE"] = "https://openai-gci-eda-ds-dev-01.openai.azure.com/"
OPENAI_KEY_VALUE = dbutils.secrets.get(scope = SECRET_SCOPE, key = "<insert the name of your openai key here>")
OPENAI_API_ENDPOINT = "<insert the name of your openai key here>"
OPENAI_LLM = "<insert the name of your azure openai deployment here>"
OPENAI_EMBEDDER: str = "<insert the name of your OpenAI embedder here>"

# Define your Cognitive Search endpoint for the vector store
COGSRCH_ENDPOINT = "<insert the name of your Cognitive Search endpoint here>"
COGSRCH_INDEX: str = "<insert the name of the index in Cognitive Search you will generate here>"
COGSRCH_KEY_VALUE: str = "<insert the name of the Cognitive Search key here>"


## Load the PDFs

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader

raw_folder = "<directory to PDFs here>""
path_folder = "<directory for vector store here>"

# Use langchain to load PDFs
loader = PyPDFDirectoryLoader("<directory to PDFs here>")


## Split the PDFs into smaller chunks

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# Define text splitter for langchain
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)

pages = loader.load_and_split(text_splitter=text_splitter)
print(f"Loaded {len(pages)} chunks")

## Encode the chunks and store them in an indexed vector store

In [None]:
# Call the Azure OpenAI embedder and affiliate it to a Cognitive Search vector store
embeddings: OpenAIEmbeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY_VALUE, deployment=OPENAI_EMBEDDER)
index_name: str = "langchain-hackathon"
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=COGSRCH_ENDPOINT,
    azure_search_key=COGSRCH_KEY_VALUE,
    index_name=COGSRCH_INDEX,
    embedding_function=embeddings.embed_query,
)

In [None]:
# Using langchain, encode the PDFs and add them to the vector store
vector_store.add_documents(documents=pages)

In [None]:
# Perform a sample similarity search
docs = vector_store.similarity_search(
    query="How many vacation days do I get?",
    k=1,
    search_type="similarity",
)
print(docs[0].page_content)

## Put it all together into a QA chain
Langchain combines the vector store retriever and the LLM into one chain that will answer a query with an informed response.

In [None]:
# Use langchain to combine the retriever and a call to OpenAI as an LLM to build the application

from langchain.chains import RetrievalQA
from langchain.chat_models import AzureChatOpenAI
import langchain
langchain.debug = True   # Set this to False once you see how it works

llm = AzureChatOpenAI(deployment_name=OPENAI_LLM, temperature=0, openai_api_key=OPENAI_KEY_VALUE)
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vector_store.as_retriever(k=2))
qa_chain({"query": "How many vacation days do I get?"})