In [17]:
#pip install minsearch

In [16]:
#!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

## Quick intro to RAG

This is an intro to RAG and also search. We will be using a mini search engine file already created in a previous zoomcamp workshop to boost our solution.

### About Minsearch
A minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields. The library provides two implementations:

- Index: A basic search index using scikit-learn's TF-IDF vectorizer
- AppendableIndex: An appendable search index using an inverted index implementation that allows for incremental document addition

To install, use `pip install minsearch`

You can view full details of the library [here](https://github.com/alexeygrigorev/minsearch)

In [4]:
# minsearch is already built in a previous zoomcamp workshop. see link above
from minsearch import Index
import json

### The docs
The document used below has also been converted to json for the best outcome. You can create a doc parser to convert documents to json to continue with the below. So we ca start with using requests to fetch the document, convert it to json or pandas dataframe and then a dictionary for final processing.

<strong>To dataframe and dict()</strong>
- pandas read for csv, tsv, text, excel and json
- python-docx for DOCX for Google Docs files
- Then `.to_dict()` to convert to dictionary

In [20]:
with open('./documents.json', 'rt') as f_in:
    document = json.load(f_in)

docs = []

for course in document:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        docs.append(doc)

In [6]:
docs[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Create and fit the index.
- Remember fitting from when you are training an ML model using SKlearn. Yes, that fit.
- Fit takes training data as input and learns the necessary parameters or patterns from this data.
- Now we initialise the `index` class from minsearch and feed the `docs` above (already parsed as a list) into the index

### Boosting and filtering
This is used to add weight to the search keys. In the above document we have three keys, text, section, question and there is a course. If we want to prioritise text and question, we use the boost parameter to set that weight. 

This is similar to when you are defining Algolia search index. Or Elastic search index.

Filter helps you to restrict your search responses to a particular set of records.

Example: `filter_dict = {"course": "data-engineering-zoomcamp"}`

In [23]:
index = Index(
    text_fields=["text", "question", "section"],
    keyword_fields=["course"]
)
index.fit(docs)

<minsearch.minsearch.Index at 0x774cf874bc80>

In [33]:
boost_dict = {"question": 5, "text": 3, "section": 1}
filter_dict = {"course": "data-engineering-zoomcamp"}

In [42]:
results = index.search(
    query=query,
    boost_dict=boost_dict,
    filter_dict=filter_dict,
    num_results=5
)

In [43]:
# performing an actual search
query = "Can I join the course if it has already started?"

In [None]:
for result in results:
    print(json.dumps(result, indent=2))

## Generating answers
A quick recap on so far and the journey ahead.

- RAG: Is typically a search engine for a corpus of data. For example, FAQ documents or any other document provided. Best to be structured for the best outcome.
- LLM: WHen a user sends a query, it hits the search engine which is based on your data. The ouput of the query is usually a lot of responses, these responses are then sent to an LLM to generate a summary of all the responses.

Example: When you search for something on Google, there is an AI summary at the top. This is merely a summary of all the links you are about to scroll throguh on the first page of Google.

Will be using Gemini insteap of Open AI used during the class.

In [14]:
# imports
from google import genai

In [31]:
# initialise gemini client
client = genai.Client(api_key="AIzaSyCWYdzvCfj_...-v5z5A_1CkE7vc")

In [55]:
# Create a prompt template to guide the LLM
template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NOTHING FOUND

QUESTION: {question}

CONTEXT: 
{context}
"""

In [63]:
# create a context based on the results from the search query executed above
# That is, we searched the document and got several results. All results now form our context for the LLM.
# For generic cases where we don't know the structure of people's documents, we either would create templates to guide them setting up their instance
# or create for them.

context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [57]:
prompt = template.format(question=query, context=context).strip()

In [58]:
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NOTHING FOUND

QUESTION: Can I join the course if it has already started?

CONTEXT: 
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final cap

In [62]:
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
)

print(response.text)

Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.



## Converting it into a function

In [19]:
#importing the libraries
from minsearch import Index
import json
import requests
from google import genai

In [2]:
url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
response = requests.get(url)
document = response.json()

In [3]:
#with open('./documents.json', 'rt') as f_in:
#    document = json.load(f_in)

docs = []

for course in document:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        docs.append(doc)

In [30]:
gemini_key = "AIzaSyCWYdzvCfj_...-v5z5A_1CkE7vc"

In [32]:
# the search function
def search(query):
    index = Index(
    text_fields=["text", "question", "section"],
    keyword_fields=["course"]
    )
    index.fit(docs)
    
    boost_dict = {"question": 5, "text": 3, "section": 1}
    filter_dict = {"course": "data-engineering-zoomcamp"}
    
    results = index.search(
    query=query,
    boost_dict=boost_dict,
    filter_dict=filter_dict,
    num_results=5
    )

    return results

In [33]:
def build_prompt(query, search_results):
    template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
    Use only the facts from the CONTEXT when answering the QUESTION.
    If the CONTEXT doesn't contain the answer, output NOTHING FOUND
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """
    
    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = template.format(question=query, context=context).strip()

    return prompt

In [34]:
def llm_response(prompt):
    client = genai.Client(api_key=gemini_key)
    response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
    )

    return print(response.text)

In [35]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm_response(prompt)

    return answer

In [36]:
rag("can i register for the course?")

Register before the course starts using this link.
Yes, even if you don't register, you're still eligible to submit the homeworks.



## Running search using elastic search

- In a terminal, run elastic search in docker
  ```bash
  docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
    ```
- If port 9200 is not forwarded but docker ran perfectly fine, add the port to the terminal.
- Check that the port is actuall working by opening a new terminal > ```curl http://localhost:9200```
- See data engineering course for more detail on docker.
- Also read more about ES [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/elastic-search.md)
- More resources [here](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/01-intro) for the module one to explore more about RAG GUI, and other people's notes

In [1]:
#import libraries
#fetch the doc through the url
#iterate through the documents
#importing the libraries
from minsearch import Index
import json
import requests
from google import genai
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from tqdm.auto import tqdm

In [2]:
url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
response = requests.get(url)
document = response.json()

In [3]:
docs = []

for course in document:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        docs.append(doc)

In [29]:
gemini_key = "AIzaSyCWYdzvCfj_...-v5z5A_1CkE7vc"

In [8]:
# create es client
es_client = Elasticsearch('http://localhost:9200')

In [9]:
#run the index settings
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "dynamic": "strict",  # Prevent unwanted field additions
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

#give the index a name
index_name = "course-questions"

# Delete old index if exists
#if es_client.indices.exists(index=index_name):
#    es_client.indices.delete(index=index_name)
    
#create the index
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [14]:
'''
success, errors = bulk(
    es_client,
    [{"_index": "course-questions", "_source": doc} for doc in docs]
)
print(f"Indexed {success} documents")
if errors:
    print("Errors:", errors)
'''

Indexed 948 documents


In [10]:
#iterate through the document to add it to ES
# this failed initially becuase I wa passing the entire docs instead of doc (single line at a time)
for doc in tqdm(docs):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [21]:
#check index exists
#print(es_client.indices.exists(index="course-questions"))

In [22]:
#show the mapping available
#print(es_client.indices.get_mapping(index="course-questions"))

In [12]:
query = "how do i enrol for the course?"

In [17]:
def elastic_search(query):
    search_query= {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es_client.search(index=index_name, body=search_query)
    
    result_doc = []
    for hit in response['hits']['hits']:
        result_doc.append(hit['_source'])

    return result_doc

In [28]:
#elastic_search(query)

In [21]:
def build_prompt(query, search_results):
    template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT provided.
    Use only the facts from the CONTEXT when answering the QUESTION.
    If the CONTEXT doesn't contain the answer, output NOTHING FOUND
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """
    
    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = template.format(question=query, context=context).strip()

    return prompt

In [24]:
def llm_response(prompt):
    client = genai.Client(api_key=gemini_key)
    response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
    )

    return print(response.text)

In [22]:
def es_rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm_response(prompt)

    return answer

In [27]:
es_rag("do i need git and github for this and how do i do that")

Yes, you will probably need git and github for this course. After you create a GitHub account, you should clone the course repo to your local machine using the process outlined in this video: Git for Everybody: How to Clone a Repository from GitHub
Having this local repository on your computer will make it easy for you to access the instructors’ code and make pull requests (if you want to add your own notes or make changes to the course content).
You will probably also create your own repositories that host your notes, versions of your file, to do this. Here is a great tutorial that shows you how to do this: https://www.atlassian.com/git/tutorials/setting-up-a-repository
Remember to ignore large database, .csv, and .gz files, and other files that should not be saved to a repository. Use .gitignore for this: https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or keys in a git repo (even if that repo is set to private).
This is also a great resource: ht