# LLM RAG Workshop

https://github.com/alexeygrigorev/llm-rag-workshop

In [9]:
import json

## Retrieval

### Loading the documents

In [10]:
with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

### Indexing the documents

___*** Remember to start elasticsearch via docker first ***___

In [11]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': 'b95a74bc9525', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HY3P6PuHQHub21BPDT1IWQ', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [12]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)

response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [13]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 948/948 [00:22<00:00, 41.23it/s]


### Retrieving the docs

* Retrieves the top 5 matching documents.
* Searches in the "question", "text", "section" fields, prioritizing "question" using multi_match query with type best_fields
* Matches user query "How do I join the course after it has started?".
* Show results only for the "data-engineering-zoomcamp" course.

In [15]:
user_question = "How do I join the course after it has started?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

In [16]:
response = es.search(index=index_name, body=search_query)

for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



### Making a function

In [17]:
def retrieve_documents(query, index_name="course-questions", max_results=5):
    es = Elasticsearch("http://localhost:9200")
    
    search_query = {
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

In [18]:
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



## Generation (a.k.a. Answering Questions)

In [19]:
import dotenv
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [20]:
%reload_ext dotenv

In [21]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)

It's possible that you can still join the course, depending on a few factors. Here are some steps you can take:

1. **Check with the Instructor or Course Coordinator:** Reach out directly to the instructor or course coordinator to inquire about late enrollment. They may be able to accommodate you or provide necessary information on catching up.

2. **Review Enrollment Policies:** Some institutions have specific policies regarding late enrollment. Check the academic policies on your institution's website or contact the registrar's office.

3. **Assess Catch-Up Work:** Consider how much of the course you've missed and whether you can realistically catch up. Ask the instructor about any missed assignments, lectures, or exams and if there are any available resources to help you get up to speed.

4. **Seek Approval:** Some courses might require special approval for late enrollment, so be prepared to explain your situation and demonstrate your commitment to catching up.

5. **Administrative 

### Building a Prompt

In [22]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

context_docs = retrieve_documents(user_question)

context_result = ""

for doc in context_docs:
    doc_str = context_template.format(**doc)
    context_result += ("\n\n" + doc_str)

context = context_result.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

In [23]:
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database. 
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

In [24]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer

"You can still join the course after it has started. Be aware that you will still need to adhere to the deadlines for turning in the final projects. It's important not to leave everything for the last minute. \n\nAdditionally, even if you don't register, you are still eligible to submit the homeworks.\n\nIf you need to access the course materials that have already been covered, all the materials will remain available after the course has finished, so you can catch up at your own pace. \n\nNONE"

### Cleaning up the code

In [25]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.  

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()


def build_context(documents):
    context_result = ""
    
    for doc in documents:
        doc_str = context_template.format(**doc)
        context_result += ("\n\n" + doc_str)
    
    return context_result.strip()


def build_prompt(user_question, documents):
    context = build_context(documents)
    prompt = prompt_template.format(
        user_question=user_question,
        context=context
    )
    return prompt

def ask_openai(prompt, model="gpt-4o"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    return answer

def qa_bot(user_question):
    context_docs = retrieve_documents(user_question)
    prompt = build_prompt(user_question, context_docs)
    answer = ask_openai(prompt)
    return answer

In [26]:
qa_bot("I'm getting invalid reference format: repository name must be lowercase")

'If you are getting the error "invalid reference format: repository name must be lowercase" when working with Docker on Windows, here are some steps you can try to resolve the issue:\n\n1. **Move Data to a Folder Without Spaces**:\n    - Ensure your code or data is located in a folder path without spaces. For example, move your code from `“C:/Users/Alexey Grigorev/git/…”` to `“C:/git/…”`.\n\n2. **Modify the Volume Mapping Command**:\n    - Replace the `-v` part in your Docker command with one of the following options to specify the volume correctly:\n\n    ```\n    -v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n    -v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n    -v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n    -v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data\n    --volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data\n    ```\n\n3. **Add `winpty` if Using `docker run -it`**:\n    - If you are

In [27]:
qa_bot("I can't connect to postgres port 5432, my password doesn't work")

'It appears that you are facing issues connecting to your Postgres instance on port 5432 due to password authentication failure. Here are a few steps to resolve this issue based on the provided CONTEXT:\n\n1. **Check Port Conflict**:\n   - You might have another instance of Postgres running on port 5432. A common solution is to change the port mapping to a different one (e.g., 5431):\n     ```bash\n     docker run -e POSTGRES_PASSWORD=root -p 5431:5432 postgres\n     ```\n   - When connecting, ensure you are using the correct port. For example:\n     ```python\n     create_engine(\'postgresql://root:root@localhost:5431/ny_taxi\')\n     ```\n\n2. **Verify Username and Password**:\n   - Ensure that the username and password you are using are correct. The error message suggests it is a password authentication error. Double-check the credentials in your connection string.\n\n3. **Check and Stop Local Postgres Service**:\n   - If Postgres is also installed locally on your machine, it might 

In [28]:
qa_bot("how can I run kafka?")

'To run a Kafka producer, consumer, or kstreams in the terminal using Java, follow these steps:\n\n1. Navigate to your project directory.\n2. Run the following command:\n   ```\n   java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n   ```\n\nReplace `<jar_name>` with the actual name of your JAR file. This command assumes you have already built your project and have the necessary JAR file in the `build/libs` directory.'