# 6. Results
To evaluate our RAG model approach, we ingested a (curated) set of FABRIC Notebooks into our Vector Database.  We then tested against several of the LLMs mentioned earlier by giving them three distinct queries to answer (i.e., questions to generate FABRIC answers (python scripts) for).   We then manually  ranked the correctness of the LLM output using a simple scoring system ranging from “Useless” to “Code is correct”.


## Queries
We chose three queries for testing, representing three levels of complexities. The queries mimic  commonly asked questions by FABRIC users of various levels of expertise.

- Easy: How can I check what slices I have?
- Intermediate: How can I look up when my slices expire and extend them by 20 days?
- Advanced: How can I create a slice with two nodes connected with L3 networks using Basic NICs and do cpu pinning and NUMA tuning and launch iperf3 tests between them?

## Scoring system
We ran each query once for each model, using the temperature of 0.0 as noted above. FABRIC software team members assigned a score of 0~4 described as follows. In many cases, we ran the generated code to confirm that it successfully reserved the correct set of resources:

0: Useless. Largely a result of hallucination. 

1: Contain some correct elements but largely incorrect

2: About half correct (some useful sections/elements, but still far)

3: Mostly correct – code would be a good starting point and usable with minor corrections

4: Code is correct and runs without any edits.


## Test Results

The effectiveness of using RAG is clearly demonstrated by the results. Without RAG, no model was able to generate  error-free code even for the easy query (that could be written in as short as 3 lines of code). For the intermediate-level question, there was little  resemblance between the no–RAG generated code and the correct result; and for Question 3 (advanced level), all outputs could be described as hallucinations. Without RAG, therefore, even the highly-rated LLMs are incapable of coherent and useful code for FABRIC users. 


On the other hand, if RAG is used  (i.e., similar FABRIC code examples retrieved from the vector store are passed to the LLM along with users query), performance improves significantly. For Question #1 and #2, about half the LLMs were able to generate error-free code, and even for Question #3, which requires a lengthy and complicated script, the average score using RAG was close to 3, implying that  the code is largely correct and can serve as a good starting point even if it contains some minor errors. That is a completely different level of correctness especially when compared to no-RAG LLM correctness for the same question, which was completely useless.

**Figure 5: Comparison of RAG vs. no RAG outputs for the same question using the same LLM**

Output on the right (with RAG) is correct. Notice the wrong import statment Without RAG, the model made up the `get_all_slices()` method


![notebook_example](images/Easy_Q_output.png)

**Figure 6: Comparison of RAG vs. no RAG outputs for the same question using the same LLM (example 2)**

Output on the left is completely useless as a template, while the one on the right, while it is not perfect, can serve as a great starting point.

![notebook_example](images/Hard_Q_output.png)

Even with RAG, some details remained imperfect. For example,  list_slices() and get_slices() are similar but the return types are different. Multiple outputs included a loop that failed to perform the intended tasks due to this.  