# Install crucial libraries

# Imports


## **Part 0: Choose and Test Your Topic Without a Knowledge Base**

Before you load any external documents, you must **verify that your chosen topic needs a knowledge base** to improve answers. This ensures your RAG system solves a real gap in the model’s knowledge.

###  **Steps:**

1. **Choose a Topic (Tentative)**

   * Pick a topic from 2024 or 2025 that you think is recent or under-documented.
   * Example topics:

     * A political decision (e.g., "European Union climate laws in 2024")
     * A cultural trend (e.g., "Music trends in early 2025")

2. **Formulate Question**

   * Write down one factual, clear question about the topic.
   * Aim for question that require up-to-date or specific knowledge.

3. **Query the Model Directly**

   * Use your LLM pipeline (without RAG) to ask this question.
   * Collect the model’s answer and evaluate their quality:

     * Are the answers incomplete?
     * Are they outdated?
     * Are they confident but wrong?
     * Do they say *"I don’t know"*?

---

Why This Matters:

This step ensures your RAG project is solving a **real information gap**, not just repeating what the model already knows.


# **Part 1: Load a Custom PDF Knowledge Base**

Find blog posts or wikipedia page with your topic and save information about it to a PDF file, and load it using `PyPDFLoader`. You may use other loaders not only pdf, but pdf loader is exactly the same as we used during lab.

- Find informative content on your topic (Wikipedia page, blog post, article, etc.)
- Save the page as a PDF file (you can use your browser’s print-to-PDF feature)


# **Part 2: Repeat the Lab with Your Own Knowledge Base + RAG Tuning**

## **Goal:**

Practice building a **RAG pipeline** and explore how **chunk size** and **chunk overlap** affect the quality of LLM answers to different questions.

---

## **What You Need to Do:**

1. **Repeat the Lab Using Your PDF Knowledge Base**

   * Use the PDF file you selected and loaded in Part 1.

2. **Create 3 Different Questions**

   * Design **three meaningful, specific questions** based on your topic.
   * Each question must be clearly related to the content of your PDF.

3. **Run RAG for Each Question with 3 Different Settings:**
   For each question:

   * Run the RAG pipeline **three times** using different settings for:

     * `chunk_size` (e.g., 100, 300, 500)
     * `chunk_overlap` (e.g., 0, 20, 50, 100)
   * This means you will run a total of **9 tests** (3 questions × 3 settings each).


4. **Answer Each Question Using an LLM**

   * Use the loaded chunks and a retriever to find relevant parts.
   * Pass the retrieved context to the LLM and generate an answer.
   * You can use similar tools as we used in the Lab

5. **Explain Your Results**
   For each of the 3 questions:

   * Write a short **description of the question** and **why you chose it**.
   * **Compare the answers** you got using different settings.
   * Reflect on:

     * How answer quality changed with different `chunk_size` and `chunk_overlap`
     * Which setting gave the most useful or accurate result
     * Why you think it performed better/worse

---

## **Deliverables:**

* Python code used for RAG pipeline (with different chunking settings)
* PDF file from Part 1
* A JSON file named rag_report_last_name_name_id.json containing your results:

  * 3 questions with explanations
  * Generated answers for each setting
  * Comparison and reflection on the results

---


### Template for your resulting json file with report

In [1]:
your_results_dict = {
  "topic": "Write the title or theme of your chosen topic here (e.g., 'AI Developments in 2024')",
  "question":"write your question you used to test your topic with plain model without RAG",
  "answer":"initial answer of the model without using RAG",
  "rag": [
    {
      "question": "Write your first custom question here",
      "reason": "Briefly explain why this question is important or relevant to your topic",
      "experiments": [
        {
          "chunk_size": "Enter an integer value for chunk size (e.g., 300, 500, etc.)",
          "chunk_overlap": "Enter an integer value for chunk overlap (e.g., 0, 100, etc.)",
          "answer": "Paste here the answer generated by the LLM for this setting",
          "reflection": "Write your analysis of the answer: Was it accurate, detailed, too short, off-topic, etc.?"
        },
        {
          "chunk_size": "Another chunk size value for testing (should differ from above)",
          "chunk_overlap": "Corresponding chunk overlap value for this test",
          "answer": "LLM-generated answer for the second test",
          "reflection": "Compare this result with the previous one—was it better or worse? Why?"
        },
        {
          "chunk_size": "Third chunk size value for testing",
          "chunk_overlap": "Third chunk overlap value for testing",
          "answer": "LLM-generated answer for the third test",
          "reflection": "Describe how this answer compares with the others and what you learned"
        }
      ]
    },
    {
      "question": "Write your second custom question here",
      "reason": "Explain why this question is meaningful to your topic",
      "experiments": [
        {
          "chunk_size": "Integer value for chunk size",
          "chunk_overlap": "Integer value for chunk overlap",
          "answer": "LLM response using this setting",
          "reflection": "Your evaluation of the answer quality with these parameters"
        },
        {
          "chunk_size": "Another chunk size",
          "chunk_overlap": "Another chunk overlap",
          "answer": "LLM response for second configuration",
          "reflection": "How did the result change? Why might that be?"
        },
        {
          "chunk_size": "Third chunk size",
          "chunk_overlap": "Third chunk overlap",
          "answer": "Final LLM result for this question",
          "reflection": "Summarize your findings from all three tests for this question"
        }
      ]
    },
    {
      "question": "Write your third custom question here",
      "reason": "Explain why this question is useful or interesting",
      "experiments": [
        {
          "chunk_size": "Integer chunk size value",
          "chunk_overlap": "Integer chunk overlap value",
          "answer": "Answer generated with this config",
          "reflection": "How well did it perform? Was it relevant?"
        },
        {
          "chunk_size": "Different chunk size value",
          "chunk_overlap": "Different chunk overlap value",
          "answer": "Second test answer",
          "reflection": "Comparison with first result"
        },
        {
          "chunk_size": "Third chunk size",
          "chunk_overlap": "Third chunk overlap",
          "answer": "Third LLM-generated answer",
          "reflection": "Overall evaluation of the test results"
        }
      ]
    }
  ]
}

In [2]:
import json

with open("rag_report_Arkadiusz_Modzelewski_29580.json", "w", encoding="utf-8") as f:
    json.dump(your_results_dict, f, indent=2, ensure_ascii=False)