In [4]:
import pandas as pd
import json
import os

Evaluation setup:  
* Embedding model - Text Embedding 004 (Google)  
* LLM - Gemini 2.5 Flash  


## RAGAS Answer Correctness:

The assessment of Answer Correctness involves measuring the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
Answer correctness  is computed as the sum of factual correctness and the semantic similarity between the given answer and the ground truth.  

Factual correctness is a metric that compares and evaluates the factual accuracy of the generated response with the reference. This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the mode parameter. By default, the mode is set to F1, you can change the mode to precision or recall by setting the mode parameter.

Answer similarity is calculated by following steps:  
Step 1: Vectorize the ground truth answer using the embedding model.  
Step 2: Vectorize the generated answer using the same embedding model.  
Step 3: Compute the cosine similarity between the two vectors.  
        
By default "text-embedding-ada-002" model is used. In that evaluation, we used Text Embedding 004 (Google).  

Final score is created by taking a weighted average of the factual correctness (F1 score) and the semantic similarity. 
(By default, there is a 0.75 : 0.25 weighting.)   

Total API Calls: 4  
* 1 LLM call to produce the "simple statements  
* 1 LLM call to determine the true positives, false positives, and false negatives  
* 1 embedding call to embed the context  
* 1 embedding call to embed the AI answer  

### Unsuccesful experiment with Gemini 2.5 flash:  
Because the generated results are non-deterministic, I run evaluation 3 times and calculate the mean. To evaluate 370 questions x 3 times x 2 (vanilla and RAG) X 4 API calls = 8880 API calls total for one set of results. When I decided to play with Gemini 2.5 flash, ​I also calculated other metrics, in total it cost 16000 API calls and 167 dollars.

**Conclusion**: I will stick with Gemini 2.0 flash model as a main judge for the rest of the project.

I feel like RAGAS use much more in terms of LLM calls than just 4 as they say in documentation. I didn't examine it properly, but there is a [source code](https://github.com/explodinggradients/ragas/blob/main/ragas/src/ragas/metrics/_answer_correctness.py).
There is a [compaint](https://www.reddit.com/r/LangChain/comments/1dbmqii/i_spent_700_on_evaluating_100_rag_qa_set_using/) about RAGAS costing 700$ for 100 QA set.      


Sources:  
* [RAGAS Docs for Answer Correctness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html)  
* [RAGAS Docs for Semantic Similarity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/semantic_similarity/)  
* [RAGAS Docs for Factual Correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/#factual-correctness) 
* [RAGAS API reference for Embeddings](https://docs.ragas.io/en/stable/references/embeddings/#ragas.embeddings.embedding_factory)  
* [Jupyter Notebbok with examples of RAGAS Evaluation pipeline](https://github.com/dkhundley/llm-rag-guide/blob/main/notebooks/ragas.ipynb)    




In [5]:
file_name = "test_dataset_together_meta-llama_Llama-4-Scout-17B-16E-Instruct_top5_answered.json"
base_name = file_name.replace('.json', '')


VANILLA_ANSWER_CORRECTNESS = f"{base_name}_vanilla_answer_correctness_evaluated.json"
RAG_ANSWER_CORRECTNESS = f"{base_name}_rag_answer_correctness_evaluated.json"
RAG_ANSWER_SIMILARITY = f"{base_name}_rag_answer_similarity_evaluated.json"
VANILLA_ANSWER_SIMILARITY = f"{base_name}_vanilla_answer_similarity_evaluated.json"
RAG_ANSWER_RELEVANCY = f"{base_name}_rag_answer_relevancy_evaluated.json"
RAG_FAITHFULNESS = f"{base_name}_rag_faithfulness_evaluated.json"

with open(VANILLA_ANSWER_CORRECTNESS, 'r', encoding='utf-8') as f:
    data = json.load(f)   
df_1 = pd.DataFrame(data)

with open(RAG_ANSWER_CORRECTNESS, 'r', encoding='utf-8') as f:
    data = json.load(f) 
df_2 = pd.DataFrame(data)
df_2 = df_2[["Modified Questions", 
                 "Answer Correctness for RAG run 1", 
                 "Answer Correctness for RAG run 2",
                 "Answer Correctness for RAG run 3",
                 "Mean Answer Correctness for RAG"]]
    
merged_df = pd.merge(df_1, df_2, on="Modified Questions", how="inner")
# Calculate overall means
vanilla_mean = merged_df['Mean Answer Correctness for vanilla'].mean()
rag_mean = merged_df['Mean Answer Correctness for RAG'].mean()
    
print(f"Overall Mean Answer Correctness for Vanilla: {vanilla_mean:.4f}")
print(f"Overall Mean Answer Correctness for RAG: {rag_mean:.4f}")
print(f"Difference (RAG - Vanilla): {rag_mean - vanilla_mean:.4f}")   



FileNotFoundError: [Errno 2] No such file or directory: 'test_dataset_together_meta-llama_Llama-4-Scout-17B-16E-Instruct_top5_answered_vanilla_answer_correctness_evaluated.json'

Again, vanilla LLM response is evaluated higher than RAG-enhanced. 

## RAGAS Answer semantic similarity  

This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment 
between the generated answer and the ground truth.  
Step 1: Vectorize the ground truth answer using the specified embedding model.  
Step 2: Vectorize the generated answer using the same embedding model.  
Step 3: Compute the cosine similarity between the two vectors.  

The metric is a part of RAGAS Answer Correctness metric. The Answer Correctness's final score is created by taking a weighted average of the factual correctness (F1 score) and the semantic similarity. 
(By default, there is a 0.75 : 0.25 weighting.)   


Total API Calls: 2
* 1 embedding call to embed the ground truth
* 1 embedding call to embed the AI answer  

The metric is produced much faster (around 30 minutes for the whole set of results). It is 25% component of Answer Correctness metric.  
Since the output is deterministic, we run evaluation only once for each question.   

Sources:  
* [Ragas Docs for semantic similarity](https://docs.ragas.io/en/v0.1.21/concepts/metrics/semantic_similarity.html)   
* https://github.com/dkhundley/llm-rag-guide/blob/main/notebooks/ragas.ipynb

In [None]:

with open(RAG_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
    data = json.load(f) 
df_3 = pd.DataFrame(data)
df_3 = df_3[["Modified Questions", 
            "Answer Semantic Similarity for rag"]]


with open(VANILLA_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
    data = json.load(f) 
df_4 = pd.DataFrame(data)
df_4 = df_4[["Modified Questions", 
            "Answer Semantic Similarity for vanilla"]]

merged_df = pd.merge(merged_df, df_3, on="Modified Questions", how="inner")
merged_df = pd.merge(merged_df, df_4, on="Modified Questions", how="inner")

# Calculate overall means
rag_similarity_mean = merged_df['Answer Semantic Similarity for rag'].mean()
vanilla_similarity_mean = merged_df['Answer Semantic Similarity for vanilla'].mean()
print(f"Overall Mean Answer Semantic Similarity for RAG: {rag_similarity_mean:.4f}")
print(f"Overall Mean Answer Semantic Similarity for Vanilla: {vanilla_similarity_mean:.4f}")
print(f"Difference (RAG - Vanilla): {rag_similarity_mean - vanilla_similarity_mean:.4f}")

# Group by psychiatric category
category_stats = merged_df.groupby('psychiatric_category').agg({
        'Answer Semantic Similarity for vanilla': 'mean',
        'Answer Semantic Similarity for rag': 'mean'}).round(4)
    
    # Flatten column names
category_stats.columns = ['Vanilla_Mean', 'RAG_Mean']
    
    # Add difference column
category_stats['Difference (RAG - Vanilla)'] = (category_stats['RAG_Mean'] - 
                                                   category_stats['Vanilla_Mean']).round(4)
    
    # Sort by difference to see which categories benefit most from RAG
category_stats = category_stats.sort_values('Difference (RAG - Vanilla)', ascending=False)
    
display(category_stats)

Overall Mean Answer Semantic Similarity for RAG: 0.8752
Overall Mean Answer Semantic Similarity for Vanilla: 0.8794
Difference (RAG - Vanilla): -0.0042


Unnamed: 0_level_0,Vanilla_Mean,RAG_Mean,Difference (RAG - Vanilla)
psychiatric_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Eating Disorders,0.8511,0.8656,0.0145
Somatic Disorders,0.8635,0.8762,0.0127
Personality Disorders,0.8951,0.9062,0.0111
Anxiety Disorders,0.8841,0.885,0.0009
Other Mental Disorders,0.8745,0.8709,-0.0036
Dissociative Disorders,0.9364,0.9323,-0.0041
Bipolar Disorders,0.883,0.8783,-0.0047
Trauma and Stressor Related Disorders,0.8932,0.8849,-0.0083
Schizophrenia Spectrum and Other Psychotic Disorders,0.8835,0.8752,-0.0083
Depressive Disorders,0.8735,0.863,-0.0105


## RAGAS Answer Relevance  

The Answer Relevance metric evaluates to what extent the generated answer addresses the provided question. The answer is considered relevant if it directly addresses the question.    
* Step 1: Reverse-engineer ‘n’ variants of the question from the generated answer using a LLM (prompt: "Generate a question for the given answer.
answer: [answer])")  
* Step 2: Generate embedding for all the questions. Calculate the mean cosine similarity between the generated questions and the actual question.  

The Answer Relevance doesn't assess factual correctness of generated answer but rather penalises redundant or insufficient answers.  

Total API Calls by default: 4  
* 1 LLM call to generate the question based on the answer (by default - 3 question)  
* 1 embedding call for each generated question   (by default - 3) 
* 1 embedding call to embed the original question  

We set LLM temperature to 0.0 and still run evaluation 3 times. The metric is produced relatively fast still, but in such a fashion we diminish the influence of non-deterministic output. Sometimes all 3 evaluations have the same results and sometimes they are slightly different.  

Sources:  
* [RAGAS Documentation](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_relevance.html)  
* [Original RAGAS paper](https://arxiv.org/abs/2309.15217)  

In [None]:
with open(RAG_ANSWER_RELEVANCY, 'r', encoding='utf-8') as f:
    data = json.load(f) 

df_4 = pd.DataFrame(data)
df_4 = df_4[["Modified Questions",
            "answer_relevancy for RAG run 1",
    "answer_relevancy for RAG run 2",
    "answer_relevancy for RAG run 3",
    "Mean answer_relevancy for RAG"]]
merged_df = pd.merge(merged_df, df_4, on="Modified Questions", how="inner")

relevancy_mean = merged_df['Mean answer_relevancy for RAG'].mean()
print(f"Overall Mean Relevancy for RAG: {relevancy_mean:.4f}")

# Group by psychiatric category
category_stats = merged_df.groupby('psychiatric_category').agg({
        'Mean answer_relevancy for RAG': 'mean'}).round(4)
    
    # Flatten column names
category_stats.columns = ['RAG_Mean']
       
# Sort by difference to see which categories benefit most from RAG
category_stats = category_stats.sort_values('RAG_Mean', ascending=False)
    
print(category_stats)

Overall Mean Relevancy for RAG: 0.7071
                                                    RAG_Mean
psychiatric_category                                        
Dissociative Disorders                                0.8022
Anxiety Disorders                                     0.7822
Eating Disorders                                      0.7656
Obsessive-Compulsive Disorders                        0.7543
Depressive Disorders                                  0.7442
Personality Disorders                                 0.7439
Trauma and Stressor Related Disorders                 0.7069
Schizophrenia Spectrum and Other Psychotic Diso...    0.7028
Somatic Disorders                                     0.6944
Bipolar Disorders                                     0.6814
Other Mental Disorders                                0.6642


# RAG Metrics  
## RAGAS Faithfulness 

Definition:  
Faithfulness or groundness is sometimes used interchangeably.  

The process:  
Faithfulness measures the factual consistency of the generated answer against the given context. It is calculated from the answer and the retrieved context. The answer is scaled to the (0, 1) range. The higher the better. A low faithfulness metric indicates that the language model outputs response without correct utilisation of provided context or doesn't find any relevant passages in it.   
The generated answer is regarded as faithful if all the claims made in the answer can be inferred from the given context.  
* At the first step, the generated answer is broken down into individual statements.     
* At the next step, each of these claims is cross-checked with the given context to determine if it can be inferred from the context.    
* The final score is calculated by dividing the number of claims that can be inferred from the context by the total number of claims in the generated response.    

We set LLM temperature to 0.0 and still run evaluation 3 times. The metric is produced relatively fast still, but in such a fashion we diminish the influence of non-deterministic output. Sometimes all 3 evaluations have the same results and sometimes they are slightly different.  

Source:  
* [RAGAS Docs for Faitfulness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html)   
* [Microsoft Learn End-to-end LLM evaluation](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-llm-evaluation-phase)  


In [None]:
with open(RAG_FAITHFULNESS, 'r', encoding='utf-8') as f:
        data = json.load(f) 
df_3 = pd.DataFrame(data)
df_3 = df_3[["Modified Questions", 
                "faithfulness for RAG run 1",
                "faithfulness for RAG run 2",
                "faithfulness for RAG run 3",
                "Mean faithfulness for RAG"]]
    
merged_df = pd.merge(merged_df, df_3, on="Modified Questions", how="inner")

faithfulness_mean = merged_df['Mean faithfulness for RAG'].mean()
print(f"Overall Mean Faithfulness for RAG: {faithfulness_mean:.4f}")

Overall Mean Faithfulness for RAG: 0.2678


Only 26% of retrieved statements are used. 

## Final function to generate report 
(work in progress)

In [None]:

def process_files(file_name, model_name='Llama-4-Scout'):
    '''
    The function will output full report for the set of results for one model.

    '''
    base_name = file_name.replace('.json', '')

    # Files with answer similarity results
    VANILLA_ANSWER_SIMILARITY = f"{base_name}_vanilla_answer_similarity_evaluated.json"
    RAG_ANSWER_SIMILARITY = f"{base_name}_rag_answer_similarity_evaluated.json"
    VANILLA_ANSWER_RELEVANCE = f"{base_name}_vanilla_answer_relevancy_evaluated.json"
    RAG_ANSWER_RELEVANCE = f"{base_name}_rag_answer_relevancy_evaluated.json"
    RAG_FAITHFULNESS = f"{base_name}_rag_faithfulness_evaluated.json"
    RAG_CONTEXT_PRECISION = f"{base_name}_rag_context_precision_evaluated.json"

    
    print("\n" + "=" * 80 + "\n")
    print(f"Evaluation of {base_name} Results:\n")

    # ANSWER SIMILARITY - BOTH VANILA AND RAG
  
    with open(VANILLA_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    vanilla_answer_similarity = pd.DataFrame(data)


    with open(RAG_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    rag_answer_similarity = pd.DataFrame(data)
    rag_answer_similarity = rag_answer_similarity[["Modified Questions", 
            "Answer Semantic Similarity for rag"]]
    
    merged_df = pd.merge(vanilla_answer_similarity, rag_answer_similarity, on="Modified Questions", how="inner")
    

    # Calculate overall means
    print("Answer Semantic Similarity Results:")
    vanilla_similarity_mean = merged_df['Answer Semantic Similarity for vanilla'].mean()
    rag_similarity_mean = merged_df['Answer Semantic Similarity for rag'].mean()
    print(f"Overall Mean Answer Semantic Similarity for RAG: {rag_similarity_mean:.4f}")
    print(f"Overall Mean Answer Semantic Similarity for Vanilla: {vanilla_similarity_mean:.4f}")
    print(f"Difference (RAG - Vanilla): {rag_similarity_mean - vanilla_similarity_mean:.4f}")

    # Group by psychiatric category
    category_stats_answer_similarity = merged_df.groupby('psychiatric_category').agg({
            'Answer Semantic Similarity for vanilla': 'mean',
            'Answer Semantic Similarity for rag': 'mean'}).round(4)
        
    
    category_stats_answer_similarity.columns = ['Vanilla_Mean', 'RAG_Mean']  
    # Add difference column
    category_stats_answer_similarity['Difference (RAG - Vanilla)'] = (category_stats_answer_similarity['RAG_Mean'] - 
                                                    category_stats_answer_similarity['Vanilla_Mean']).round(4) 
    # Sort by difference
    category_stats_answer_similarity = category_stats_answer_similarity.sort_values('Difference (RAG - Vanilla)', ascending=False)
        
    display(category_stats_answer_similarity)

    # ANSWER RELEVANCY - BOTH VANILLA AND RAG

    with open(VANILLA_ANSWER_RELEVANCE, 'r', encoding='utf-8') as f:
        data = json.load(f)
    vanilla_answer_relevancy = pd.DataFrame(data)
    vanilla_answer_relevancy = vanilla_answer_relevancy[["Modified Questions",
                                                         "answer_relevancy for Vanilla run 1",
                                                    "answer_relevancy for Vanilla run 2",
                                                    "answer_relevancy for Vanilla run 3",
                                                    "Mean answer_relevancy for Vanilla"]]
    
    merged_df = pd.merge(merged_df, vanilla_answer_relevancy, on="Modified Questions", how="inner")

    with open(RAG_ANSWER_RELEVANCE, 'r', encoding='utf-8') as f:
        data = json.load(f)
    rag_answer_relevancy = pd.DataFrame(data)
    rag_answer_relevancy = rag_answer_relevancy[["Modified Questions",
                                                 "answer_relevancy for RAG run 1",
                                                    "answer_relevancy for RAG run 2",
                                                    "answer_relevancy for RAG run 3",
                                                    "Mean answer_relevancy for RAG"]]
    
    merged_df = pd.merge(merged_df, rag_answer_relevancy, on="Modified Questions", how="inner")

    print("\n" + "=" * 80 + "\n")
    print("Answer Relevancy Results:")
    # Calculate overall means
    vanilla_answer_relevancy_mean = merged_df['Mean answer_relevancy for Vanilla'].mean()
    rag_answer_relevancy_mean = merged_df['Mean answer_relevancy for RAG'].mean()
    print(f"Overall Mean Answer Relevancy for Vanilla: {vanilla_answer_relevancy_mean:.4f}")
    print(f"Overall Mean Relevancy for RAG: {rag_answer_relevancy_mean:.4f}")

    # Group by psychiatric category
    category_stats_relevancy = merged_df.groupby('psychiatric_category').agg({
        'Mean answer_relevancy for RAG': 'mean',
        'Mean answer_relevancy for Vanilla': 'mean'}).round(4)
    
    category_stats_relevancy['Difference (RAG - Vanilla)'] = (category_stats_relevancy['Mean answer_relevancy for RAG'] -
                                                    category_stats_relevancy['Mean answer_relevancy for Vanilla']).round(4)
    category_stats_relevancy = category_stats_relevancy.sort_values('Mean answer_relevancy for RAG', ascending=False)
    display(category_stats_relevancy)

    
    # EVALUATE RAG TRIAD - FAITHFULNESS
    with open(RAG_FAITHFULNESS, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    rag_faithfulness = pd.DataFrame(data)
    rag_faithfulness = rag_faithfulness[["Modified Questions", 
                "faithfulness for RAG run 1",
                "faithfulness for RAG run 2",
                "faithfulness for RAG run 3",
                "Mean faithfulness for RAG"]]
    
    merged_df = pd.merge(merged_df, rag_faithfulness, on="Modified Questions", how="inner")
    
    print("\n" + "=" * 80 + "\n")
    print("RAG Triad - Faithfulness Results:")
    faithfulness_mean = merged_df['Mean faithfulness for RAG'].mean()
    print(f"Overall Mean Faithfulness for RAG: {faithfulness_mean:.4f}")

    # Group by psychiatric category
    category_stats_faithfulness = merged_df.groupby('psychiatric_category').agg({
        'Mean faithfulness for RAG': 'mean',
    }).round(4)
    category_stats_faithfulness = category_stats_faithfulness.sort_values('Mean faithfulness for RAG', ascending=False)
    display(category_stats_faithfulness)

    
    # EVALUATE RAG TRIAD - CONTEXT PRECISION
    with open(RAG_CONTEXT_PRECISION, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    rag_context_precision = pd.DataFrame(data)
    rag_context_precision = rag_context_precision[["Modified Questions",
                                                    "context_precision for RAG run 1",
                                                    "context_precision for RAG run 2",
                                                    "context_precision for RAG run 3",
                                                    "Mean context_precision for RAG"]]
    merged_df = pd.merge(merged_df, rag_context_precision, on="Modified Questions", how="inner")
    print("\n" + "=" * 80 + "\n")
    print("RAG Triad - Context Precision Results:") 
    context_precision_mean = merged_df['Mean context_precision for RAG'].mean()
    print(f"Overall Mean Context Precision for RAG: {context_precision_mean:.4f}")
    # Group by psychiatric category
    category_stats_context_precision = merged_df.groupby('psychiatric_category').agg({
        'Mean context_precision for RAG': 'mean',
    }).round(4) 
    category_stats_context_precision = category_stats_context_precision.sort_values('Mean context_precision for RAG', ascending=False)
    display(category_stats_context_precision)   


    






In [None]:
process_files("Llama4_maverick/test_dataset_together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8_top5_answered.json")



Evaluation of Llama4_maverick/test_dataset_together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8_top5_answered Results:

Answer Semantic Similarity Results:
Overall Mean Answer Semantic Similarity for RAG: 0.8767
Overall Mean Answer Semantic Similarity for Vanilla: 0.8776
Difference (RAG - Vanilla): -0.0009


Unnamed: 0_level_0,Vanilla_Mean,RAG_Mean,Difference (RAG - Vanilla)
psychiatric_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dissociative Disorders,0.9294,0.9477,0.0183
Somatic Disorders,0.8613,0.8759,0.0146
Eating Disorders,0.8502,0.8636,0.0134
Personality Disorders,0.8908,0.8979,0.0071
Bipolar Disorders,0.8757,0.8806,0.0049
Trauma and Stressor Related Disorders,0.8965,0.8978,0.0013
Depressive Disorders,0.8706,0.869,-0.0016
Schizophrenia Spectrum and Other Psychotic Disorders,0.8795,0.8764,-0.0031
Other Mental Disorders,0.874,0.8705,-0.0035
Anxiety Disorders,0.8935,0.8827,-0.0108




Answer Relevancy Results:
Overall Mean Answer Relevancy for Vanilla: 0.7004
Overall Mean Relevancy for RAG: 0.6472


Unnamed: 0_level_0,Mean answer_relevancy for RAG,Mean answer_relevancy for Vanilla,Difference (RAG - Vanilla)
psychiatric_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Anxiety Disorders,0.6761,0.7365,-0.0604
Schizophrenia Spectrum and Other Psychotic Disorders,0.6674,0.6756,-0.0082
Bipolar Disorders,0.6562,0.6977,-0.0415
Obsessive-Compulsive Disorders,0.6539,0.732,-0.0781
Depressive Disorders,0.6477,0.7131,-0.0654
Trauma and Stressor Related Disorders,0.6474,0.7069,-0.0595
Other Mental Disorders,0.64,0.707,-0.067
Personality Disorders,0.6304,0.6911,-0.0607
Somatic Disorders,0.6115,0.6747,-0.0632
Eating Disorders,0.6051,0.6461,-0.041




RAG Triad - Faithfulness Results:
Overall Mean Faithfulness for RAG: 0.4251


Unnamed: 0_level_0,Mean faithfulness for RAG
psychiatric_category,Unnamed: 1_level_1
Bipolar Disorders,0.5307
Depressive Disorders,0.524
Eating Disorders,0.5057
Schizophrenia Spectrum and Other Psychotic Disorders,0.4912
Anxiety Disorders,0.4764
Obsessive-Compulsive Disorders,0.393
Trauma and Stressor Related Disorders,0.3893
Personality Disorders,0.3597
Other Mental Disorders,0.3241
Dissociative Disorders,0.3077




RAG Triad - Context Precision Results:
Overall Mean Context Precision for RAG: 0.4648


Unnamed: 0_level_0,Mean context_precision for RAG
psychiatric_category,Unnamed: 1_level_1
Dissociative Disorders,1.0
Eating Disorders,0.8205
Obsessive-Compulsive Disorders,0.8028
Bipolar Disorders,0.564
Schizophrenia Spectrum and Other Psychotic Disorders,0.5585
Personality Disorders,0.5435
Anxiety Disorders,0.4886
Depressive Disorders,0.4776
Trauma and Stressor Related Disorders,0.4201
Other Mental Disorders,0.3241


In [None]:
process_files('gemma_3b/test_dataset_together_google_gemma-3n-E4B-it_top5_answered.json')



Evaluation of gemma_3b/test_dataset_together_google_gemma-3n-E4B-it_top5_answered Results:

Answer Semantic Similarity Results:
Overall Mean Answer Semantic Similarity for RAG: 0.8626
Overall Mean Answer Semantic Similarity for Vanilla: 0.8676
Difference (RAG - Vanilla): -0.0050


Unnamed: 0_level_0,Vanilla_Mean,RAG_Mean,Difference (RAG - Vanilla)
psychiatric_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Eating Disorders,0.8309,0.8633,0.0324
Obsessive-Compulsive Disorders,0.8658,0.8803,0.0145
Somatic Disorders,0.8514,0.8566,0.0052
Personality Disorders,0.876,0.8808,0.0048
Schizophrenia Spectrum and Other Psychotic Disorders,0.8664,0.8665,0.0001
Anxiety Disorders,0.8784,0.8777,-0.0007
Bipolar Disorders,0.8671,0.8648,-0.0023
Trauma and Stressor Related Disorders,0.8863,0.8799,-0.0064
Depressive Disorders,0.8637,0.8546,-0.0091
Other Mental Disorders,0.8678,0.8527,-0.0151




Answer Relevancy Results:
Overall Mean Answer Relevancy for Vanilla: 0.6818
Overall Mean Relevancy for RAG: 0.5792


Unnamed: 0_level_0,Mean answer_relevancy for RAG,Mean answer_relevancy for Vanilla,Difference (RAG - Vanilla)
psychiatric_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Obsessive-Compulsive Disorders,0.7147,0.731,-0.0163
Eating Disorders,0.684,0.7189,-0.0349
Trauma and Stressor Related Disorders,0.6472,0.7019,-0.0547
Schizophrenia Spectrum and Other Psychotic Disorders,0.6454,0.6593,-0.0139
Anxiety Disorders,0.6266,0.7313,-0.1047
Bipolar Disorders,0.611,0.7126,-0.1016
Personality Disorders,0.576,0.5449,0.0311
Depressive Disorders,0.559,0.6867,-0.1277
Dissociative Disorders,0.5508,0.5508,0.0
Other Mental Disorders,0.5118,0.6839,-0.1721




RAG Triad - Faithfulness Results:
Overall Mean Faithfulness for RAG: 0.5077


Unnamed: 0_level_0,Mean faithfulness for RAG
psychiatric_category,Unnamed: 1_level_1
Bipolar Disorders,0.5411
Anxiety Disorders,0.538
Schizophrenia Spectrum and Other Psychotic Disorders,0.5358
Depressive Disorders,0.5313
Eating Disorders,0.5296
Obsessive-Compulsive Disorders,0.5222
Dissociative Disorders,0.5208
Other Mental Disorders,0.5164
Personality Disorders,0.4375
Somatic Disorders,0.3796




RAG Triad - Context Precision Results:
Overall Mean Context Precision for RAG: 0.2998


Unnamed: 0_level_0,Mean context_precision for RAG
psychiatric_category,Unnamed: 1_level_1
Dissociative Disorders,1.0
Obsessive-Compulsive Disorders,0.5833
Eating Disorders,0.5769
Schizophrenia Spectrum and Other Psychotic Disorders,0.4872
Bipolar Disorders,0.3872
Depressive Disorders,0.3062
Anxiety Disorders,0.2121
Other Mental Disorders,0.2025
Personality Disorders,0.1848
Trauma and Stressor Related Disorders,0.184


In [None]:
import scipy.stats as stats
import numpy as np

def calculate_confidence_interval(values, confidence=0.95):
    """
    Calculate confidence interval for a series of values.
    
    Args:
        values: pandas Series or list of values
        confidence: confidence level (default 0.95 for 95% CI)
    
    Returns:
        tuple: (mean, lower_bound, upper_bound, margin_of_error)
    """
    n = len(values)
    mean = np.mean(values)
    std_err = stats.sem(values)  # Standard error of the mean
    
    # Calculate margin of error using t-distribution
    t_value = stats.t.ppf((1 + confidence) / 2, df=n-1)
    margin_of_error = t_value * std_err
    
    lower_bound = mean - margin_of_error
    upper_bound = mean + margin_of_error
    
    return mean, lower_bound, upper_bound, margin_of_error

def process_files_cis(file_name, model_name='Llama-4-Scout'):
    
    base_name = file_name.replace('.json', '')

    # Files with answer similarity results
    VANILLA_ANSWER_SIMILARITY = f"{base_name}_vanilla_answer_similarity_evaluated.json"
    RAG_ANSWER_SIMILARITY = f"{base_name}_rag_answer_similarity_evaluated.json"
    
    VANILLA_ANSWER_RELEVANCE = f"{base_name}_vanilla_answer_relevancy_evaluated.json"
    RAG_ANSWER_RELEVANCE = f"{base_name}_rag_answer_relevancy_evaluated.json"
    
    
    RAG_FAITHFULNESS = f"{base_name}_rag_faithfulness_evaluated.json"
    RAG_CONTEXT_PRECISION = f"{base_name}_rag_context_precision_evaluated.json"

    
    print("\n" + "=" * 80 + "\n")
    print(f"Evaluation of {model_name} Results:\n")

    # ANSWER SIMILARITY - BOTH VANILA AND RAG
  
    with open(VANILLA_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    vanilla_answer_similarity = pd.DataFrame(data)


    with open(RAG_ANSWER_SIMILARITY, 'r', encoding='utf-8') as f:
        data = json.load(f) 
    rag_answer_similarity = pd.DataFrame(data)
    rag_answer_similarity = rag_answer_similarity[["Modified Questions", 
            "Answer Semantic Similarity for rag"]]
    
    merged_df = pd.merge(vanilla_answer_similarity, rag_answer_similarity, on="Modified Questions", how="inner")
    

    # By category
    print("\nBy Psychiatric Category:")
    category_sim_results = []
    
    for category in merged_df['psychiatric_category'].unique():
        category_data = merged_df[merged_df['psychiatric_category'] == category]
        
        vanilla_cat_ci = calculate_confidence_interval(category_data['Answer Semantic Similarity for vanilla'])
        rag_cat_ci = calculate_confidence_interval(category_data['Answer Semantic Similarity for rag'])
        
        # Paired t-test for this category
        cat_t_stat, cat_p_value = stats.ttest_rel(category_data['Answer Semantic Similarity for rag'],
                                                 category_data['Answer Semantic Similarity for vanilla'])
        
        category_sim_results.append({
            'Category': category,
            'N': len(category_data),
            'Vanilla_Mean': vanilla_cat_ci[0],
            'Vanilla_CI_Lower': vanilla_cat_ci[1],
            'Vanilla_CI_Upper': vanilla_cat_ci[2],
            'RAG_Mean': rag_cat_ci[0],
            'RAG_CI_Lower': rag_cat_ci[1],
            'RAG_CI_Upper': rag_cat_ci[2],
            'Difference': rag_cat_ci[0] - vanilla_cat_ci[0],
            'P_Value': cat_p_value,
            'Significant': 'Yes' if cat_p_value < 0.05 else 'No'
        })
    
    category_sim_df = pd.DataFrame(category_sim_results)
    category_sim_df = category_sim_df.sort_values('Difference', ascending=False)
    display(category_sim_df)




    
    # ANSWER RELEVANCY - BOTH VANILLA AND RAG

    with open(VANILLA_ANSWER_RELEVANCE, 'r', encoding='utf-8') as f:
        data = json.load(f)
    vanilla_answer_relevancy = pd.DataFrame(data)
    vanilla_answer_relevancy = vanilla_answer_relevancy[["Modified Questions",
                                                         "answer_relevancy for Vanilla run 1",
                                                    "answer_relevancy for Vanilla run 2",
                                                    "answer_relevancy for Vanilla run 3",
                                                    "Mean answer_relevancy for Vanilla"]]
    
    merged_df = pd.merge(merged_df, vanilla_answer_relevancy, on="Modified Questions", how="inner")

    with open(RAG_ANSWER_RELEVANCE, 'r', encoding='utf-8') as f:
        data = json.load(f)
    rag_answer_relevancy = pd.DataFrame(data)
    rag_answer_relevancy = rag_answer_relevancy[["Modified Questions",
                                                 "answer_relevancy for RAG run 1",
                                                    "answer_relevancy for RAG run 2",
                                                    "answer_relevancy for RAG run 3",
                                                    "Mean answer_relevancy for RAG"]]
    
    merged_df = pd.merge(merged_df, rag_answer_relevancy, on="Modified Questions", how="inner")
    
    # For Answer Relevancy - calculate CI for individual runs
    print("\n" + "=" * 80 + "\n")
    print("Answer Relevancy Results:")
    
    vanilla_means = merged_df['Mean answer_relevancy for Vanilla'].values
    rag_means = merged_df['Mean answer_relevancy for RAG'].values
    
    # Calculate overall CI across all questions
    overall_v_mean, overall_v_lower, overall_v_upper, overall_v_margin = calculate_confidence_interval(vanilla_means)
    overall_r_mean, overall_r_lower, overall_r_upper, overall_r_margin = calculate_confidence_interval(rag_means)
    
    print(f"Overall Mean Answer Relevancy for Vanilla: {overall_v_mean:.4f} [95% CI: {overall_v_lower:.4f} - {overall_v_upper:.4f}]")
    print(f"Overall Mean Answer Relevancy for RAG: {overall_r_mean:.4f} [95% CI: {overall_r_lower:.4f} - {overall_r_upper:.4f}]")
    
    # Paired t-test
    t_stat, p_value = stats.ttest_rel(rag_means, vanilla_means)
    print(f"Difference (RAG - Vanilla): {overall_r_mean - overall_v_mean:.4f} (p-value: {p_value:.4f})")
    
    for category in merged_df['psychiatric_category'].unique():
        category_data = merged_df[merged_df['psychiatric_category'] == category]
        
        vanilla_cat_means = category_data['Mean answer_relevancy for Vanilla'].values
        rag_cat_means = category_data['Mean answer_relevancy for RAG'].values
        
        vanilla_cat_ci = calculate_confidence_interval(vanilla_cat_means)
        rag_cat_ci = calculate_confidence_interval(rag_cat_means)
        
        cat_t_stat, cat_p_value = stats.ttest_rel(rag_cat_means, vanilla_cat_means)


    
    
    for _, row in category_sim_df.iterrows():
        print(f"{row['Category']} (n={row['N']}):")
        print(f"  Vanilla: {row['Vanilla_Mean']:.4f} [95% CI: {row['Vanilla_CI_Lower']:.4f} - {row['Vanilla_CI_Upper']:.4f}]")
        print(f"  RAG:     {row['RAG_Mean']:.4f} [95% CI: {row['RAG_CI_Lower']:.4f} - {row['RAG_CI_Upper']:.4f}]")
        print(f"  Difference: {row['Difference']:.4f} (p={row['P_Value']:.4f}) {'***' if row['Significant'] == 'Yes' else ''}")
        print()

    # ====== ANSWER RELEVANCY WITH CIS BY CATEGORY ======
    print("\n" + "=" * 80 + "\n")
    print("Answer Relevancy Results by Category:")
    
    # Define run columns
    vanilla_relevancy_runs = ['answer_relevancy for Vanilla run 1', 'answer_relevancy for Vanilla run 2', 'answer_relevancy for Vanilla run 3']
    rag_relevancy_runs = ['answer_relevancy for RAG run 1', 'answer_relevancy for RAG run 2', 'answer_relevancy for RAG run 3']
    

In [None]:
process_files_cis("Llama4_maverick/test_dataset_together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8_top5_answered.json")



Evaluation of Llama-4-Scout Results:



Answer Relevancy Results:
Overall Mean Answer Relevancy for Vanilla: 0.7004 [95% CI: 0.6910 - 0.7098]
Overall Mean Answer Relevancy for RAG: 0.6472 [95% CI: 0.6292 - 0.6651]
Difference (RAG - Vanilla): -0.0532 (p-value: 0.0000)
*** Statistically significant difference at α = 0.05 ***

By Psychiatric Category:
Dissociative Disorders (n=1):
  Vanilla: 0.9294 [95% CI: nan - nan]
  RAG:     0.9477 [95% CI: nan - nan]
  Difference: 0.0182 (p=nan) 

Somatic Disorders (n=10):
  Vanilla: 0.8613 [95% CI: 0.8141 - 0.9084]
  RAG:     0.8759 [95% CI: 0.8384 - 0.9133]
  Difference: 0.0146 (p=0.2440) 

Eating Disorders (n=13):
  Vanilla: 0.8502 [95% CI: 0.8223 - 0.8782]
  RAG:     0.8636 [95% CI: 0.8371 - 0.8900]
  Difference: 0.0134 (p=0.3196) 

Personality Disorders (n=23):
  Vanilla: 0.8908 [95% CI: 0.8707 - 0.9109]
  RAG:     0.8979 [95% CI: 0.8778 - 0.9181]
  Difference: 0.0071 (p=0.1449) 

Bipolar Disorders (n=33):
  Vanilla: 0.8757 [95% CI: 0.8639 - 0

  std_err = stats.sem(values)  # Standard error of the mean
  var *= np.divide(n, n-ddof)  # to avoid error on division by zero
  var *= np.divide(n, n-ddof)  # to avoid error on division by zero
