# Tutorial: Weaviate RAG Walkthrough

This guide walks you through the setup of Weaviate in a Docker container, and building a simple RAG application using OpenAI API for the LLM and the embeddings.

Please add your OPENAI_API_KEY as an environmental variable.

`export OPENAI_API_KEY=your_actual_api_key_here`

In [None]:
! pip install langchain_openai langchainhub

First, import the libraries (I'll use LangChain as the orchestrator):

In [None]:
from langchain import hub
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Weaviate

Next, identify the PDFs you want to load

In [None]:
# Identify PDFs to Load

pdf1 = "documents/42337723.pdf"
pdf2 = "documents/main_notes.pdf"

Define a function that takes in an array of files, and returns the PDF pages as a LangChain document object.

In [None]:
# PDF Loader 

def load_pdf(files):
    pages = []
    for file in files: 
        loader = PyPDFLoader(file)
        pages += loader.load_and_split()
    return pages

Load the PDFs

In [None]:
# Load in the PDFs as `pdf_pages` 
pdf_pages = load_pdf([pdf1, pdf2])

Let's check how many pages we have:

In [None]:
len(pdf_pages)

Print out a few pages to see the content we have.

In [None]:
# Print out the first 100 characters
print(pdf_pages[0].page_content[:100])

print(pdf_pages[205].page_content[:100])

We can also print the metadata to see the document path (source) and page

In [None]:
# Print the metadata
print(pdf_pages[0].metadata)

print(pdf_pages[205].metadata)

----------

### Load it into a weaviate database 

In [None]:
# Define the embeddings model
embeddings = OpenAIEmbeddings()

Now, we create the database "db" from the documents

In [None]:
db = Weaviate.from_documents(pdf_pages, embeddings, weaviate_url="http://localhost:8080", by_text=False)

### Query to find Similar Documents

In [None]:
query = "What does physics have to do with violins?"
docs = db.similarity_search(query)

print(docs[0].metadata)
print(docs[0].page_content[:100])
print(docs[1].metadata)
print(docs[1].page_content[:100])

In [None]:
query = "What are some reasons a model might overfit?"
docs = db.similarity_search(query)

print(docs[0].metadata)
print(docs[0].page_content[:100])
print(docs[1].metadata)
print(docs[1].page_content[:100])

### RAG to Answer Questions

First, we set the Weavaiate database as our retriever:

In [None]:
# set the weaviate database as the retiever
retriever = db.as_retriever()
# define our prompt as the RAG prompt from LangChain's prompt hub
prompt = hub.pull("rlm/rag-prompt")
# Set the LLM to use ChatOpenAI 3.5-Turbo
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

Set up the RAG chain

In [None]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("What does physics have to do with violins?")

In [None]:
rag_chain.invoke("What are some reasons a model might overfit?")

