## Introduction to LLM and RAG

In [1]:
#!pip install minsearch

In [2]:
import minsearch

In [3]:
import json

In [4]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [5]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [6]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [7]:
index = minsearch.Index(
    text_fields = ["question", "text", "section"],
    keyword_fields = ["course"]
)

In [8]:
# Idea of index
# SELECT  * WHERE course = 'data-engineering-zoomcamp';

In [9]:
q = 'the course has already started, can I still enroll?'

In [10]:
index.fit(documents)

<minsearch.minsearch.Index at 0x76d5b846a510>

In [11]:
boost = {'question': 3.0, 'section' : 0.5}

index.search(
    query = q,
    boost_dict = boost,
    num_results = 6
)

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

In [12]:
# filtered_results
results = index.search(
    query = q,
    filter_dict = {'course': 'data-engineering-zoomcamp'},
    boost_dict = boost,
    num_results = 6
)

In [96]:
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

In [14]:
from openai import OpenAI

In [15]:
client = OpenAI()

In [16]:
response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":q}]
                 )

In [17]:
response

ChatCompletion(id='chatcmpl-BjApcMATY2p6YbUYTEakHPX32bNA6', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Whether you can still enroll in a course that has already started depends on the policies of the institution offering the course. Here are a few steps you can take:\n\n1. **Check the Course or Institution's Policy**: Some institutions allow late enrollment within a certain timeframe. Look for information on their website or in the course details.\n\n2. **Contact the Instructor or Administration**: Reach out to the instructor or the administrative office of the institution. They may be able to make an exception or provide guidance on how to catch up.\n\n3. **Consider Online Courses**: If it's an online course, there might be more flexibility. Some online platforms allow you to enroll at any time and work at your own pace.\n\n4. **Assess Feasibility**: Even if you are allowed to enroll, consider whether you can realistically catc

In [18]:
response.choices[0].message.content

"Whether you can still enroll in a course that has already started depends on the policies of the institution offering the course. Here are a few steps you can take:\n\n1. **Check the Course or Institution's Policy**: Some institutions allow late enrollment within a certain timeframe. Look for information on their website or in the course details.\n\n2. **Contact the Instructor or Administration**: Reach out to the instructor or the administrative office of the institution. They may be able to make an exception or provide guidance on how to catch up.\n\n3. **Consider Online Courses**: If it's an online course, there might be more flexibility. Some online platforms allow you to enroll at any time and work at your own pace.\n\n4. **Assess Feasibility**: Even if you are allowed to enroll, consider whether you can realistically catch up on the material and meet the course requirements.\n\nEach institution has its own rules, so it's essential to communicate directly with them to understand 

In [97]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database. 
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE

QUESTION: {question}

CONTEXT:
{context}
""".strip()

In [98]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [21]:
print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - When will the course start?
answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [22]:
prompt = prompt_template.format(question=q, context = context).strip()

In [23]:
response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":prompt}]
                 )

response.choices[0].message.content

"Yes, even if you don't register, you're still eligible to submit the homework. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."

### 1.5 - The RAG Flow Cleaning and Modularizing Code

In [24]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query = query,
        filter_dict = {'course': 'data-engineering-zoomcamp'},
        boost_dict = boost,
        num_results = 5
    )

    return results

- The `search` function is designed to query an index (likely a search engine or vector database) for relevant results. That is to find the most relevant documents or passages related to the query.
- The `query` argument is the search string or phrase input by the user.
- The `boost` dictionary give the importance of certain fields in the search index.
- We are currently interested in only documents where the course field matches 'data-engineering-zoomcamp'. It gives a list of dictionaries.

In [71]:
def build_prompt(query, search_results):
    prompt_template = """
                        You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
                        Use only the facts from the CONTEXT when answering the QUESTION.
                        
                        QUESTION: {question}
                        
                        CONTEXT: 
                        {context}
                        """.strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

- The `build_prompt` function constructs a prompt for a language model (like GPT) to answer a user’s question using information from a set of search results.
- In other words, this function constructs a carefully formatted prompt, embedding both the user’s question and the relevant context (extracted from the search results).
- In this case, the model gives answers using only the provided context. That is, the prompt is designed to instruct the LLM to answer using only the provided context.
- The `search_results` contains a list of retrieved documents (e.g., `search(query)`) relevant to the query.
- To build the context, we first initialize an empty string to accumulate context snippets, then iterates through each document in `search_results`, where for each document, it appends a formatted string to `context` including `section`, `question` and `answer`.
- Next, we build the prompt using `prompt_template.format()`.

In [72]:

def llm(prompt):
    response = client.chat.completions.create(
                model = 'gpt-4o',
                messages = [{"role": "user", "content":prompt}]
                 )

    return response.choices[0].message.content

- The `llm` function is designed to interact with a Large Language Model (LLM)—specifically OpenAI’s GPT-4o—by sending a prompt and receiving a generated response.
- The LLM reads the prompt, processes the context, and generates a response. The response is expected to be grounded in the provided context, not invented from scratch.
- `client` is an instance of the OpenAI API client (typically from openai Python library).
- `.chat.completions.create` is the method used to interact with OpenAI’s chat-based models (like GPT-4o).
- `response` is the object returned by the API call. It contains a list of possible completions (choices).

### RAG workflow

In [27]:
query = 'how do I run kafka?'

In conclusion we do the following to answer a user’s question:
- Retrieve relevant information from a database or knowledge base
- Build a context-rich prompt,
- Pass that prompt to a language model (LLM) for answer generation.

In [28]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [29]:
rag(query)

'To run Kafka, if you are working with Java, in your project directory, you can execute the Java Kafka producer, consumer, or kstreams by using the following command in the terminal:\n\n```bash\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nFor running Python-based Kafka applications, ensure that you set up a virtual environment and install the necessary dependencies using `pip install -r ../requirements.txt`.'

In [30]:
rag('the course has already started, can I still enroll?')

"Yes, you can still enroll in the course after it has started. You are eligible to submit the homeworks, but keep in mind that there will be deadlines for turning in the final projects, so it's important not to leave everything until the last minute."

## 1.6 Search with Elasticsearch

We want to transition from a toy search engine to Elasticsearch for better search results. Elasticsearch is a versatile platform that enables fast, scalable, and flexible search and analytics across a wide range of data and use cases.


In [156]:
# Recall
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [157]:
from elasticsearch import Elasticsearch

In [158]:
es_client = Elasticsearch('http://127.0.0.1:9200')

We will define index `settings` and `mappings` up front ensures our data is stored and searched efficiently and accurately. 
- Shards and replicas affect performance, scalability, and reliability.
- Mappings define the structure of our documents, i.e., what fields they have and what data types those fields are.

In [159]:
es_client.info()

ObjectApiResponse({'name': '9eb7322ebe21', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'pu_30jWiQh-awbWUMlJbMQ', 'version': {'number': '9.0.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '0a58bc1dc7a4ae5412db66624aab968370bd44ce', 'build_date': '2025-05-28T10:06:37.834829258Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'})

In [160]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

The name of the index in Elasticsearch where our documents (course questions, in this case) will be stored.

Existence checks prevent errors from trying to create an index that’s already present.

In [161]:
index_name = "course-questions"

if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)
else:
    print(f"Index '{index_name}' already exists.")

Index 'course-questions' already exists.


In [162]:
from tqdm.auto import tqdm

In [163]:
for doc in tqdm(documents):
    es_client.index(index = index_name, document = doc)

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [00:02<00:00, 415.04it/s]


In [164]:
query = 'I just discovered the course. Can I still join?'

In [165]:
def elastic_search(query):

    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"], #question is 3 times more important than text and section
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index = index_name, body = search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

The `elastic_search` function is designed to query an Elasticsearch index for the most relevant documents matching a user’s query, with custom weighting and filtering.


In [166]:
elastic_search(query)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and 

In [167]:
def elas_rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [168]:
elas_rag(query)

"Yes, you can still join the course even after it has started. You are eligible to submit homework even if you haven't registered yet. However, keep in mind that there will be deadlines for turning in the final projects, so it's advisable not to leave everything for the last minute."

---

---
## Homework 1.0

In [169]:
!pip install requests



In [338]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [339]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [340]:
index_name = "homework"

if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)
else:
    print(f"Index '{index_name}' already exists.")

Index 'homework' already exists.


In [341]:
for doc in tqdm(documents):
    es_client.index(index = index_name_1, document = doc)

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 948/948 [00:01<00:00, 475.65it/s]


### Q3. Searching

In [342]:
query = "How do execute a command on a Kubernetes pod?"

In [343]:
def elastic_search(query):

    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },
        
            }
        }
    }

    response = es_client.search(index = index_name, body = search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
        #result_docs.append(hit['_score'])
    
    return result_docs

In [344]:
elastic_search(query)

[{'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I debug a docker container?',
  'course': 'machine-learning-zoomcamp'},
 {'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I debug a docker container?',
  'course': 'machine-learning-zoomcamp'},
 {'text': 'Launch the container im

### Q4. Filtering

In [327]:
query_fil = "How do copy a file to a Docker container?"

In [328]:
def elastic_search_1(query):

    search_query = {
        "size": 3,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                },"filter": {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
        
            }
        }
    }

    response = es_client.search(index = index_name, body = search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
        #result_docs.append(hit['_score'])
    
    return result_docs

In [329]:
elas_results = elastic_search_1(query_fil)

In [330]:
elas_results

[{'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I debug a docker container?',
  'course': 'machine-learning-zoomcamp'},
 {'text': 'Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.\ndocker run -it --entrypoint bash <image>\nIf the container is already running, execute a command in the specific container:\ndocker ps (find the container-id)\ndocker exec -it <container-id> bash\n(Marcos MJD)',
  'section': '5. Deploying Machine Learning Models',
  'question': 'How do I debug a docker container?',
  'course': 'machine-learning-zoomcamp'},
 {'text': 'Launch the container im

### Q5. Building a prompt

In [372]:
context_template = """
Q: {question}
A: {text}
""".strip()

In [382]:
prompt_template = """
                        You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
                        Use only the facts from the CONTEXT when answering the QUESTION.
                        
                        QUESTION: {question}
                        
                        CONTEXT: 
                        {context}
                        """.strip()

context = ""
    
for doc in elas_results:
    context = context + f"question: {doc['question']}\nanswer: {doc['text']}\n\n"
    
prompt = prompt_template.format(question=query_fil, context=context).strip()
print(prompt)
print('Length of the resulting prompt:',len(prompt))

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
                        Use only the facts from the CONTEXT when answering the QUESTION.

                        QUESTION: How do copy a file to a Docker container?

                        CONTEXT: 
                        question: How do I debug a docker container?
answer: Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.
docker run -it --entrypoint bash <image>
If the container is already running, execute a command in the specific container:
docker ps (find the container-id)
docker exec -it <container-id> bash
(Marcos MJD)

question: How do I debug a docker container?
answer: Launch the container image in interactive mode and overriding the entrypoint, so that it starts a bash command.
docker run -it --entrypoint bash <image>
If the container is already running, execute a command in the specific container:
docker ps (find

### Q6. Tokens

In [376]:
import tiktoken

In [377]:
encoding = tiktoken.encoding_for_model("gpt-4o")

In [378]:
tokens = encoding.encode(prompt)
print("Number of tokens:", len(tokens))

Number of tokens: 301


In [379]:
encoding.decode_single_token_bytes(63842)

b"You're"