# Experiment Results Document

---

## Experiment 1: Chunking Strategies

### Methodology

* Used **sentence‑based chunking** in the RAG pipeline.
* Evaluated using **10 test questions**.
* Each answer was scored on:

  * Relevance (1–5)
  * Correctness (1–5)
  * Completeness (1–5)
* Final scores were calculated using **average values from Jupyter output**.

### Results

| Strategy                | Avg Relevance | Avg Correctness | Avg Completeness | Pros                                        | Cons                                        |
| ----------------------- | ------------- | --------------- | ---------------- | ------------------------------------------- | ------------------------------------------- |
| Sentence‑based Chunking | **3.3**       | **3.4**         | **2.5**          | Preserves context, improves answer matching | Some answers lacked full supporting details |

### Key Findings

* Chunking achieved good relevance and correctness scores.
* Completeness was lower because some retrieved chunks did not contain full context.
* Sentence splitting helped maintain meaning and improved retrieval quality.

### Decision‑Making Rationale

Sentence‑based chunking was selected because it provided the best balance between context preservation and retrieval accuracy.

### Conclusion

Sentence‑based chunking performed effectively because it preserved contextual information, which improved the relevance and correctness of retrieved answers.

---

## Experiment 2: Prompt Engineering

### Methodology

* Used **structured prompting** for answer generation.
* Evaluated using the same **10 test questions**.
* Scored using relevance, correctness, and completeness.
* Average scores calculated from evaluation data.

### Results

| Strategy             | Avg Relevance | Avg Correctness | Avg Completeness | Pros                                                       | Cons                    |
| -------------------- | ------------- | --------------- | ---------------- | ---------------------------------------------------------- | ----------------------- |
| Structured Prompting | **3.3**       | **3.4**         | **2.5**          | Clear instructions, reduced ambiguity, better explanations | Slightly longer prompts |

### Key Findings

* Prompt engineering achieved the highest overall performance.
* Answers were clearer and more detailed.
* Structured prompts reduced hallucination and confusion.

### Decision‑Making Rationale

Structured prompting was chosen because it consistently improved clarity, accuracy, and completeness of generated responses.

### Conclusion

Structured prompting performed best because clear instructions helped the model generate more relevant and complete answers.

---

## Experiment 3: Retrieval Strategy (Top‑K)

### Methodology

* Tested retrieval performance using Top‑K configuration.
* Used the same **10 evaluation questions**.
* Responses were scored on relevance, correctness, and completeness.
* Average scores computed from Jupyter evaluation data.

### Results

| Strategy        | Avg Relevance | Avg Correctness | Avg Completeness | Pros                                  | Cons                            |
| --------------- | ------------- | --------------- | ---------------- | ------------------------------------- | ------------------------------- |
| Top‑K Retrieval | **2.4**       | **2.7**         | **1.8**          | Retrieves focused information quickly | Often misses supporting context |

### Key Findings

* Retrieval alone produced the lowest scores.
* Lower completeness indicates missing supporting information.
* Shows importance of combining retrieval with chunking and prompting.

### Decision‑Making Rationale

A balanced Top‑K value is necessary to ensure relevant results while maintaining sufficient answer completeness.

### Conclusion

Retrieval alone was not sufficient; it works best when combined with chunking and prompt engineering.

---

## Comparison Summary

| Metric       | Chunking | Prompting | Retrieval |
| ------------ | -------- | --------- | --------- |
| Relevance    | 3.3      | **3.4**   | 2.5       |
| Correctness  | 3.3      | **3.4**   | 2.5       |
| Completeness | 2.4      | **2.7**   | 1.8       |

The comparison graph generated in Jupyter visually confirms that prompting achieved the highest overall performance.

---

## Final Recommendations

Based on evaluation averages:

* Use **sentence‑based chunking** to preserve context.
* Use **structured prompting** for clearer and more complete answers.
* Use **balanced Top‑K retrieval** to support relevant context.

---

## Overall Conclusion

The experimental results show that **prompt engineering had the greatest impact** on answer quality, followed by chunking strategy. Retrieval alone produced lower performance, but when combined with chunking and structured prompts, it significantly improved the effectiveness of the RAG system.
