# The Challenge
Tackling an interesting problem: given a user query, search through a PDF document and provide feedback on how well the query aligns with the document's content.

# The Solution
The approach is a three-step process: Load & Index, Search & RAG, and Feedback Generation.

## Load & Index
First, I need to understand the PDF document. I do this by creating a "semantic index". It's like creating a map of the document, but instead of landmarks, we have vectors.

## Search & RAG
Next, I take your query and find the most related parts in the PDF document. This is where RAG (Retrieval Augmentation Generation) comes in. It's like giving the system a cheat sheet before the big test.

## Feedback Generation
Finally, I generate feedback for you. This isn't just a simple "yes" or "no". I provide detailed feedback with references from the PDF document. It's like having footnotes for your query.

# Caveats and Considerations
- **AI and LLMs are still evolving**. They’re like teenagers - unpredictable and constantly changing. So, sometimes, they might give you surprising outputs.
- **Quality of feedback** depends on your query. Too short or too vague, and you might not get accurate feedback. Too long or complex, and you might confuse the system. For optimal results, aim for the Goldilocks zone - a few sentences to about 1 paragraph. Not too short, not too long.
- **Feedback may be slow**. It might take a few seconds before the LLM responds.

# Dive Deeper
The code is open for you to explore. Feel free to fork it, and see how it fits your use case. I'll break down each section and highlight key points.

If you have any suggestions or improvements, don't hesitate to share. For more technical details, continue reading along this Jupyter notebook.  

I started to develop this with a Jupiter Notebook, then implemented into fastapi backend and sveltekit frontend.  I then re-organized and rewritten the Jupiter notebook to distill and explain the different essential parts.  I also changed the implementation from AzureOpenAI to OpenAI, AzureSearch to Chroma.  Langchain's wrappers around them make it easy swap the implementations.

Import depedencies and init environment. Have a .env file with following vars:

`
OPENAI_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
`

In [None]:
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from langchain_community.vectorstores import Chroma
from operator import itemgetter

load_dotenv()


My example below uses OWASP Web Security Testing Guideline 4.2 PDF, tells the LLM that it is an expert cyber security expert.  You can play around with different data, variables below are the likely ones that is needed to be changed when the data is changed.  The llm_personality and llm_reference_instruction will be incorporated to a system prompt.  These are important prompt engineering as it gives context to the LLM, the LLM can also create the reference links.  In this example it only displays the page numbers, it can be more complex than this like creating html links to the reference document.

In [None]:
filename = 'wstg-v4.2.pdf'

prompt_placeholders = {
    'topic': 'OWASP Web Security Testing Guideline',
    'llm_personality': 'You are a web cyber security expert. Upon searching the OWASP Web Security Testing Guideline, provide details of Test such as: Summary, Objectives, How to Test and Remediation',
    'llm_reference_instructions': "An example is 'Web Security T esting Guide v4.2445', the page is 445 as it the digits after v4.2.  Put in the reference Page: 445"
}

user_query = """
Testing Login

1. Go to the login page, login as the test user.
2. Try to login with incorrect password
3. After successful login with correct password check profile page is correct.

"""


Key components to be initialized

In [None]:

loader = PyPDFLoader(f'./pdfs/{filename}')

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

vector_store = Chroma(filename, embeddings)

model = ChatOpenAI(model='gpt-4-turbo-preview')

Load & Index the pdf into the vector database.

In [None]:

documents = loader.load_and_split()

vector_store.add_documents(documents=documents)

Search the index to test it out.

In [None]:
def search(query):

    docs = vector_store.similarity_search(
        query=query,
        k=5 # returns 5 documents, play around this value what works better.  Too little means some information is ommitted, too much may mean too big of a context
    )
    return docs

search(user_query)

Use a prompt template that has placeholders between {}.

In [None]:
template_system = """

The following search results in order of descending relevance:

{context}

Based on the information retrieved, please provide a response to the user's query.  

Since there are several possible matches, use footnotes style to clearly mark which reference that matched was taken from. 
Specify the only the page taken from in the reference.  The page number extracted from the search results starting words. 
{llm_reference_instructions}
"""


template_user = """
Does user query follow {topic}

user query below:

{user_query}
"""

prompt = ChatPromptTemplate.from_messages([
        SystemMessagePromptTemplate.from_template(template_system),
        HumanMessagePromptTemplate.from_template(template_user)
    ])

Below is how to build the chain  The start of the chain you create a dictionary where the values are runnables. To pass in values you have to use itemgetter. One of the important subchain is a retriever which effectively puts content in the {context} place holder.

In [None]:
retriever = vector_store.as_retriever(search_type='similarity')
context_subchain = itemgetter('user_query') | retriever

chain = (
    {
        'context': context_subchain, 
        'user_query': RunnablePassthrough(), 
        'topic': itemgetter('topic'),
        'llm_personality': itemgetter('llm_personality'),
        'llm_reference_instructions': itemgetter('llm_reference_instructions')
    } 
    | prompt
    | model
    | StrOutputParser()
)

completions = chain.invoke({**{'user_query': user_query}, **prompt_placeholders}) # merge the dictionaries

Since we are using jupyter notebooks using the IPython.display would render the content much more better

In [None]:
from IPython.display import display, Markdown

display(Markdown(completions))