In [17]:
#pip install minsearch

In [16]:
#!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

## Quick intro to RAG

This is an intro to RAG and also search. We will be using a mini search engine file already created in a previous zoomcamp workshop to boost our solution.

### About Minsearch
A minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields. The library provides two implementations:

- Index: A basic search index using scikit-learn's TF-IDF vectorizer
- AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition

To install, use `pip install minsearch`

You can view full details of the library [here](https://github.com/alexeygrigorev/minsearch)

In [22]:
# minsearch is already built in a previous zoomcamp workshop. see link above
from minsearch import Index
import json

### The docs
The document used below has also been converted to json for the best outcome. You can create a doc parser to convert documents to json to continue with the below. So we ca start with using requests to fetch the document, convert it to json or pandas dataframe and then a dictionary for final processing.

<strong>To dataframe and dict()</strong>
- pandas read for csv, tsv, text, excel and json
- python-docx for DOCX for Google Docs files
- Then `.to_dict()` to convert to dictionary

In [19]:
with open('./documents.json', 'rt') as f_in:
    document = json.load(f_in)

docs = []

for course in document:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        docs.append(doc)

In [20]:
docs[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Create and fit the index.
- Remember fitting from when you are training an ML model using SKlearn. Yes, that fit.
- Fit takes training data as input and learns the necessary parameters or patterns from this data.
- Now we initialise the `index` class from minsearch and feed the `docs` above (already parsed as a list) into the index

### Boosting and filtering
This is used to add weight to the search keys. In the above document we have three keys, text, section, question and there is a course. If we want to prioritise text and question, we use the boost parameter to set that weight. 

This is similar to when you are defining Algolia search index. Or Elastic search index.

Filter helps you to restrict your search responses to a particular set of records.

Example: `filter_dict = {"course": "data-engineering-zoomcamp"}`

In [23]:
index = Index(
    text_fields=["text", "question", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

<minsearch.minsearch.Index at 0x774cf874bc80>

In [33]:
boost_dict = {"question": 5, "text": 3, "section": 1}
filter_dict = {"course": "data-engineering-zoomcamp"}

In [42]:
results = index.search(
    query=query,
    boost_dict=boost_dict,
    filter_dict=filter_dict,
    num_results=5
)

In [43]:
# performing an actual search
query = "Can I join the course if it has already started?"

In [None]:
for result in results:
    print(json.dumps(result, indent=2))

## Generating answers
A quick recap on so far and the journey ahead.

- RAG: Is typically a search engine for a corpus of data. For example, FAQ documents or any other document provided. Best to be structured for the best outcome.
- LLM: WHen a user sends a query, it hits the search engine which is based on your data. The ouput of the query is usually a lot of responses, these responses are then sent to an LLM to generate a summary of all the responses.

Example: When you search for something on Google, there is an AI summary at the top. This is merely a summary of all the links you are about to scroll throguh on the first page of Google.

Will be using Gemini insteap of Open AI used during the class.

In [45]:
# imports
from google import genai

In [46]:
# initialise gemini client
client = genai.Client(api_key="AIzaSyCWYdzvCfj_Ze1olVtYV-v5z5A_1CkE7vc")

In [55]:
# Create a prompt template to guide the LLM
template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NOTHING FOUND

QUESTION: {question}

CONTEXT: 
{context}
"""

In [63]:
# create a context based on the results from the search query executed above
# That is, we searched the document and got several results. All results now form our context for the LLM.
# For generic cases where we don't know the structure of people's documents, we either would create templates to guide them setting up their instance
# or create for them.

context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [57]:
prompt = template.format(question=query, context=context).strip()

In [58]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NOTHING FOUND

QUESTION: Can I join the course if it has already started?

CONTEXT: 
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final cap

In [62]:
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
)

print(response.text)

Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.



## Converting it into a function

In [64]:
# the search function
def search(query):
    boost_dict = {"question": 5, "text": 3, "section": 1}
    filter_dict = {"course": "data-engineering-zoomcamp"}
    
    results = index.search(
    query=query,
    boost_dict=boost_dict,
    filter_dict=filter_dict,
    num_results=5
    )

    return results

In [66]:
def build_prompt(query, search_results):
    template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
    Use only the facts from the CONTEXT when answering the QUESTION.
    If the CONTEXT doesn't contain the answer, output NOTHING FOUND
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """
    
    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = template.format(question=query, context=context).strip()

    return prompt

In [75]:
def llm_response(prompt):
    response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
    )

    return response.text

In [78]:
def rag(query):
    query = query
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm_response(prompt)

    return answer

In [82]:
query = "how long is the course?"

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm_response(prompt)

    return answer

In [83]:
answer

'NOTHING FOUND\n'