# Choose Model gpt-o4-mini

In [1]:
import torch
print("CUDA Available: ", torch.cuda.is_available())
print("CUDA Device Name: ", torch.cuda.get_device_name(0))
torch.cuda.empty_cache()

# Verificar se CUDA está disponível para acelerar o processamento
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Usando dispositivo: {device}")

CUDA Available:  True
CUDA Device Name:  NVIDIA GeForce RTX 3050 Ti Laptop GPU
Usando dispositivo: cuda


## Gpt-4o-mini

In [2]:
from openai import OpenAI

In [4]:
# completion = client.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=[
#         {"role": "user", "content": "hello?"}
#     ]
# )


In [13]:
# # Test gpt-4o-mini
# response = completion.choices[0].message.content
# print(response)

Hello! How can I assist you today?


# Dataset TeleQnA for Inference

In [3]:
import json

# Path to the TeleQnA processed question in JSON file
rel17_100_questions_path = r"../Files/rel17_100_questions.json"

# Load the TeleQnA data just release 17
with open(rel17_100_questions_path, "r", encoding="utf-8") as file:
    rel17_100_questions = json.load(file)
print(len(rel17_100_questions))

100


In [4]:
rel17_100_questions[0]

{'question': 'Which NGAP procedure is used for inter-system load balancing? [3GPP Release 17]',
 'option 1': 'eNB Configuration Transfer',
 'option 2': 'Downlink RAN Configuration Transfer',
 'option 3': 'Uplink RAN Configuration Transfer',
 'option 4': 'MME Configuration Transfer',
 'answer': 'option 3: Uplink RAN Configuration Transfer',
 'explanation': 'The NGAP procedure used for inter-system load balancing is Uplink RAN Configuration Transfer.',
 'category': 'Standards overview'}

# Accuracy Evaluation

## Create prompt and Ask function for Llama 3.2 with no Fine-Tuning

In [59]:

def ask_gpt4(question_data):
    """
    Function to generate an answer using the GPT-4o-mini model based on the given question and options.

    Parameters:
    - question_data: Dictionary containing the question and options.

    Returns:
    - String: Model's generated response.
    """
    # Initialize the OpenAI client
    client = OpenAI()

    # Extract question and options
    question = question_data['question']
    options = [f"{key}: {value}" for key, value in question_data.items() if 'option' in key]

    # Create the prompt
    prompt = (
        f"Question: {question}\n"
        f"Options:\n" + "\n".join(options) + "\n"
        "Answer with the correct option in the format 'correct option: <X>'."
    )

    # Generate the response using GPT-4o-mini
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Controls randomness
        max_tokens=512,   # Limits the response length
        top_p=0.9,        # Nucleus sampling
        frequency_penalty=0,  # Prevents word repetition
        presence_penalty=0  # Encourages variety in output
    )

    # Extract and return the generated response
    response = completion.choices[0].message.content.strip()
    return response

In [21]:
question_data = {
    'question': 'Which physical channel informs the UE and the RN about the number of OFDM symbols used for the PDCCHs? [3GPP Release 17]',
    'option 1': 'PBCH',
    'option 2': 'PCFICH',
    'option 3': 'PDSCH',
    'option 4': 'PHICH',
    'answer': 'option 2: PCFICH',
    'explanation': 'The physical control format indicator channel (PCFICH) informs the UE and the RN about the number of OFDM symbols used for the PDCCHs.',
    'category': 'Standards specifications'
}

gpt4_response = ask_gpt4(question_data)
print(gpt4_response)

In release 17 of the 3GPP specifications, different physical channels serve various purposes in communicating information within the LTE (or NR) network. To answer the question regarding which physical channel informs the UE (User Equipment) and the RN (Relay Node) about the number of OFDM symbols used for the PDCCH (Physical Downlink Control Channel):

1. PBCH (Physical Broadcast Channel): This channel carries system information. However, it does not provide information regarding the number of OFDM symbols for PDCCH.

2. PCFICH (Physical ControlFormat Indicator Channel): This channel specifically informs the UE about the number of OFDM symbols used for PDCCHs in a given subframe. Thus, it fits the requirement perfectly.

3. PDSCH (Physical Downlink Shared Channel): While it is responsible for broadcast data transmissioní, it does not convey control information related to the OFDM symbols.

4. PHICH (Physical Harq Indicator Channel): This one contaains acknowledgements/indications but 

## Create Function to Evaluate Question 

In [38]:
import re

def extract_option(answer):
    """
    Extract the option part from the answer string, removing all punctuation and converting to lowercase.
    
    Parameters:
    - answer: A string containing the answer in the format 'option X: ...'.

    Returns:
    - String: Extracted option (e.g., 'option 2'), or None if no match is found.
    """
    # Remove all punctuation and convert to lowercase
    cleaned_answer = re.sub(r'[^\w\s]', '', answer.lower())
    # Search for the option in the format "option X"
    match = re.search(r'option \d+', cleaned_answer)
    return match.group(0).strip() if match else None

In [39]:
def evaluate_model_response(model_response, question_data):
    """
    Compare the model's response with the correct answer from the question data.
    
    Parameters:
    - model_response: The response string generated by the model.
    - question_data: Dictionary containing the question, options, and the correct answer.

    Returns:
    - 1 if the response is correct, otherwise the extracted model option.
    """
    correct_option = extract_option(question_data['answer'])  # Extract correct option
    model_option = extract_option(model_response)  # Extract model's option
    # print(model_option, correct_option)

    return 1 if model_option == correct_option else model_option  # Return 1 if correct, else model's option


In [40]:
question_data = {
    'question': 'Which physical channel informs the UE and the RN about the number of OFDM symbols used for the PDCCHs? [3GPP Release 17]',
    'option 1': 'PBCH',
    'option 2': 'PCFICH',
    'option 3': 'PDSCH',
    'option 4': 'PHICH',
    'answer': 'option 2: PCFICH',
    'explanation': 'The physical control format indicator channel (PCFICH) informs the UE and the RN about the number of OFDM symbols used for the PDCCHs.',
    'category': 'Standards specifications'
}

In [41]:
evaluation_result = evaluate_model_response(gpt4_response, question_data)
print(evaluation_result)

1


## Ask to model Llama 3.2 TeleQnA 100 question 

In [60]:
def gpt4_evaluate_questions(questions):
    """
    Process all questions and return the model responses.
    
    Parameters:
    - questions: List of dictionaries containing question data, where each dictionary has:
        - 'question': A string representing the question to be asked to the model.
        - 'answer': A string representing the correct answer format (e.g., 'option 2: PCFICH').
        - 'response': A string that will contain the model's generated response to the question.
    
    Returns:
    - List: A list of dictionaries where each dictionary contains:
        - 'question': The question as a string.
        - 'answer': The correct answer as a string.
        - 'response': The model's generated response for that question.
    """
    
    responses = []
    total_questions = len(questions)
    
    for idx, question_data in enumerate(questions):
        response = ask_gpt4(question_data)
        responses.append({
            "question": question_data['question'],
            "answer": question_data['answer'],
            "response": response
        })
        
        # Print progress
        print(f"Responded {idx + 1} of {total_questions} questions...")

    return responses

In [66]:
# Process all questions and get responses
gpt4_responses = gpt4_evaluate_questions(rel17_100_questions)

Responded 1 of 100 questions...
Responded 2 of 100 questions...
Responded 3 of 100 questions...
Responded 4 of 100 questions...
Responded 5 of 100 questions...
Responded 6 of 100 questions...
Responded 7 of 100 questions...
Responded 8 of 100 questions...
Responded 9 of 100 questions...
Responded 10 of 100 questions...
Responded 11 of 100 questions...
Responded 12 of 100 questions...
Responded 13 of 100 questions...
Responded 14 of 100 questions...
Responded 15 of 100 questions...
Responded 16 of 100 questions...
Responded 17 of 100 questions...
Responded 18 of 100 questions...
Responded 19 of 100 questions...
Responded 20 of 100 questions...
Responded 21 of 100 questions...
Responded 22 of 100 questions...
Responded 23 of 100 questions...
Responded 24 of 100 questions...
Responded 25 of 100 questions...
Responded 26 of 100 questions...
Responded 27 of 100 questions...
Responded 28 of 100 questions...
Responded 29 of 100 questions...
Responded 30 of 100 questions...
Responded 31 of 100

In [67]:
print(gpt4_responses[0]['response'])

correct option: option 2


## Save accuracy responses

In [68]:
def save_responses_to_json(responses, filename):
    """
    Save the model responses to a JSON file.
    
    Parameters:
    - responses: List of responses to save.
    - filename: Name of the JSON file.
    """
    
    with open(filename, "w") as json_file:
        json.dump(responses, json_file, indent=4)

In [69]:
# save_responses_to_json(gpt4_responses,"../Models_responses/Accuracy/gpt4_responses.json")

## Evaluate responses from Llama 3.2

In [70]:
# Path to the TeleQnA processed question in JSON file
gpt4_responses_path = r"../Models_responses/Accuracy/gpt4_responses.json"

# Load the TeleQnA data just release 17
with open(gpt4_responses_path, "r", encoding="utf-8") as file:
    gpt4_responses = json.load(file)
print(len(gpt4_responses))

100


In [73]:
def evaluate_accuracy(model_responses):
    """
    Evaluate the model's responses and calculate accuracy.
    """
    correct_count = 0  # Track the number of correct responses
    none_count = 0  # Track the number of 'None' responses

    for index, question_data in enumerate(model_responses):
        evaluation_result = evaluate_model_response(question_data['response'], question_data)
        options = [f"{key}: {value}" for key, value in rel17_100_questions[index].items() if 'option' in key]

        if evaluation_result == 1:
            correct_count += 1  # Increment for correct response
        elif evaluation_result is None:
            # Print only responses that are None
            print("\nWrong Answer")
            print(f"Question {index + 1}: {question_data['question']}")
            print(f"Options:\n" + "\n".join(options) + "\n")
            print(f"Correct response: {question_data['answer']}")
            print(f"Full model response:\n{question_data['response']}")
            print("----------------------------------------------------------------------------------------")
            none_count += 1  # Increment for None response
        else:
            print("\nWrong Answer")
            print(f"Question {index + 1}: {question_data['question']}")
            print(f"Options:\n" + "\n".join(options) + "\n")
            print(f"Correct response: {question_data['answer']}")
            print(f"Model response: {evaluation_result}")
            print("----------------------------------------------------------------------------------------")

    # Calculate and print accuracy
    accuracy = correct_count / len(model_responses) * 100
    print(f"\nAccuracy: {accuracy:.2f}%")
    print(f"Total 'None' responses: {none_count}")
    print(f"'None' responses means that the model did not give an option")


In [74]:
evaluate_accuracy(gpt4_responses)


Wrong Answer
Question 1: Which NGAP procedure is used for inter-system load balancing? [3GPP Release 17]
Options:
option 1: eNB Configuration Transfer
option 2: Downlink RAN Configuration Transfer
option 3: Uplink RAN Configuration Transfer
option 4: MME Configuration Transfer

Correct response: option 3: Uplink RAN Configuration Transfer
Model response: option 2
----------------------------------------------------------------------------------------

Wrong Answer
Question 7: What is the purpose of load-balancing steering mode enhancements? [3GPP Release 17]
Options:
option 1: To provide better network performance measurements
option 2: To prioritize non-3GPP access over 3GPP access in load balancing
option 3: To enable the UE and UPF to freely select split percentages for each access type
option 4: To enhance the functionality of the AMF

Correct response: option 3: To enable the UE and UPF to freely select split percentages for each access type
Model response: option 2
-------------

# RAGAS evaluation

## Create prompt with no option and Ask function for Llama 3.2 with no Fine-Tuning

In [5]:
def ask_gpt4_no_options(question_data):
    """
    Function to generate an answer using the GPT-4o-mini model based on the given question.

    Parameters:
    - question_data: Dictionary containing the question and options.

    Returns:
    - String: Model's generated response.
    """
    # Initialize the OpenAI client
    client = OpenAI()

    # Extract question and options
    question = question_data['question']
    # options = [f"{key}: {value}" for key, value in question_data.items() if 'option' in key]

    # Create the prompt
    prompt = (
        f"Question: {question}\n"
        "Think step by step before answering and respond with a final answer in the format 'answer: <XXXXX>'."
    )

    # Generate the response using GPT-4o-mini
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # Controls randomness
        max_tokens=512,   # Limits the response length
        top_p=0.9,        # Nucleus sampling
        frequency_penalty=0,  # Prevents word repetition
        presence_penalty=0  # Encourages variety in output
    )

    # Extract and return the generated response
    response = completion.choices[0].message.content.strip()
    return response

In [81]:
question_data = {
    'question': 'Which physical channel informs the UE and the RN about the number of OFDM symbols used for the PDCCHs? [3GPP Release 17]',
    'option 1': 'PBCH',
    'option 2': 'PCFICH',
    'option 3': 'PDSCH',
    'option 4': 'PHICH',
    'answer': 'option 2: PCFICH',
    'explanation': 'The physical control format indicator channel (PCFICH) informs the UE and the RN about the number of OFDM symbols used for the PDCCHs.',
    'category': 'Standards specifications'
}

gpt4_response_text = ask_gpt4_no_options(question_data)
print(gpt4_response_text)

To determine which physical channel informs the User Equipment (UE) and the Relay Node (RN) about the number of OFDM symbols used for the Physical Downlink Control Channel (PDCCH), we need to consider the specifications outlined in 3GPP Release 17.

1. **Understanding PDCCH**: The PDCCH is used to carry control information to the UE, including scheduling assignments, hybrid automatic repeat requests (HARQ), and other control messages.

2. **Role of Physical Channels**: In LTE and NR (New Radio), there are specific physical channels that convey important information about the configuration of the control channels.

3. **Physical Broadcast Channel (PBCH)**: The PBCH is a physical channel that carries system information, including information about the configuration of the PDCCH.

4. **PDCCH Configuration Information**: The number of OFDM symbols allocated for the PDCCH is typically included in the system information transmitted over the PBCH.

5. **Conclusion**: Based on this understandi

In [6]:
def format_answer(answer):
    # Remove punctuation and convert to lowercase
    answer_no_punctuation = answer.translate(str.maketrans('', '', string.punctuation))
    return answer_no_punctuation.lower()

In [7]:
import re
import string

def extract_answer(response):
    """
    Extracts the answer from the model's response if it contains 'answer:'.
    If 'answer:' is not present, returns the entire response.

    Parameters:
    - response: String containing the model's generated response.

    Returns:
    - String: Formatted extracted answer or the full response formatted.
    """
    keyword = "answer:"

    # Check if the keyword exists in the response
    if keyword in response.lower():
        # Extract everything after 'answer:'
        extracted = response.lower().rsplit(keyword, 1)[1].strip()
    else:
        # Use the full response if 'answer:' is not found
        extracted = response.strip()

    # Format the extracted answer
    return format_answer(extracted)

In [None]:
extracted_answer = extract_answer(gpt4_response_text)
print(extracted_answer)

In [91]:
correct_answer = format_answer(question_data['explanation'])
print(correct_answer)

the physical control format indicator channel pcfich informs the ue and the rn about the number of ofdm symbols used for the pdcchs


## Model Groq for RAGAS evaluation

In [9]:
import os

if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")

In [95]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    # model="llama-3.1-70b-versatile",
    model="llama3-70b-8192",
    # model="llama3-groq-70b-8192-tool-use-preview",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [26]:
# from langchain_ollama import ChatOllama

# llm = ChatOllama(
#     model = "llama3.1",
#     temperature = 0.8,
#     num_predict = 256,
#     # other params ...
# )

In [11]:
llm.invoke("Hello")

AIMessage(content='Hello. How can I assist you today?', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 10, 'prompt_tokens': 36, 'total_tokens': 46, 'completion_time': 0.04, 'prompt_time': 0.009800944, 'queue_time': 0.005168585999999999, 'total_time': 0.049800944}, 'model_name': 'llama-3.1-70b-versatile', 'system_fingerprint': 'fp_b3ae7e594e', 'finish_reason': 'stop', 'logprobs': None}, id='run-78d21cf9-110d-4491-9477-6f4d0302e143-0', usage_metadata={'input_tokens': 36, 'output_tokens': 10, 'total_tokens': 46})

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings
# from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
2024-10-30 11:41:37.823647: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-30 11:41:37.946862: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 11:41:37.998133: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 11:41:38.013365: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register

## Ask to model Llama 3.2 TeleQnA 100 question with no options

In [100]:
def evaluate_questions_no_options(questions):
    """
    Process all questions and return the model responses.
    
    Parameters:
    - model: The language model loaded for inference.
    - tokenizer: The tokenizer configured with `get_chat_template`.
    - questions: List of dictionaries containing question data, where each dictionary has:
        - 'question': A string representing the question to be asked to the model.
        - 'answer': A string representing the correct answer format (e.g., 'option 2: PCFICH').
        - 'response': A string that will contain the model's generated response to the question.
    
    Returns:
    - List: A list of dictionaries where each dictionary contains:
        - 'question': The question as a string.
        - 'answer': The correct answer as a string.
        - 'response': The model's generated response for that question.
    """
    
    responses = []
    total_questions = len(questions)
    
    for idx, question_data in enumerate(questions):
        response = ask_gpt4_no_options(question_data)
        responses.append({
            "question": question_data['question'],
            "answer": question_data['explanation'],
            "response": response
        })
        
        # Print progress
        print(f"Responded {idx + 1} of {total_questions} questions...")

    return responses

In [101]:
# Process all questions and get responses
gpt4_responses_RAGAS = evaluate_questions_no_options(rel17_100_questions)

Responded 1 of 100 questions...
Responded 2 of 100 questions...
Responded 3 of 100 questions...
Responded 4 of 100 questions...
Responded 5 of 100 questions...
Responded 6 of 100 questions...
Responded 7 of 100 questions...
Responded 8 of 100 questions...
Responded 9 of 100 questions...
Responded 10 of 100 questions...
Responded 11 of 100 questions...
Responded 12 of 100 questions...
Responded 13 of 100 questions...
Responded 14 of 100 questions...
Responded 15 of 100 questions...
Responded 16 of 100 questions...
Responded 17 of 100 questions...
Responded 18 of 100 questions...
Responded 19 of 100 questions...
Responded 20 of 100 questions...
Responded 21 of 100 questions...
Responded 22 of 100 questions...
Responded 23 of 100 questions...
Responded 24 of 100 questions...
Responded 25 of 100 questions...
Responded 26 of 100 questions...
Responded 27 of 100 questions...
Responded 28 of 100 questions...
Responded 29 of 100 questions...
Responded 30 of 100 questions...
Responded 31 of 100

In [104]:
print(gpt4_responses_RAGAS[0]['question'])
print(extract_answer(gpt4_responses_RAGAS[0]['response']))
print(gpt4_responses_RAGAS[0]['answer'])

Which NGAP procedure is used for inter-system load balancing? [3GPP Release 17]
load balancing request
The NGAP procedure used for inter-system load balancing is Uplink RAN Configuration Transfer.


In [105]:
# save_responses_to_json(gpt4_responses_RAGAS,"../Models_responses/RAGAS/gpt4_responses_RAGAS.json")

## Build Dataset for Evaluation with RAGAS

In [86]:
# Path to the TeleQnA processed question in JSON file
gpt4_responses_RAGAS_path = r"../Models_responses/RAGAS/gpt4_responses_RAGAS.json"

# Load the TeleQnA data just release 17
with open(gpt4_responses_RAGAS_path, "r", encoding="utf-8") as file:
    gpt4_responses_RAGAS = json.load(file)
print(len(gpt4_responses_RAGAS))

100


In [87]:
from datasets import Dataset 

In [88]:
def transform_dataset(data):
    """Transform the dataset to the required format."""
    transformed_data = {
        'user_input': [],
        'response': [],
        'reference': []
    }

    for item in data:
        # print(f"\n{item['question']}\n{item['answer']}\n{item['response']}")
        question = item['question']
        model_response = format_answer(extract_answer(item['response']))
        correct_answer = format_answer(item['answer'])

        transformed_data['user_input'].append(question)
        transformed_data['response'].append(model_response)
        transformed_data['reference'].append(correct_answer)

    return transformed_data

In [89]:
# Transform the responses  dataset
data_samples = transform_dataset(gpt4_responses_RAGAS[:20])
# data_samples = transform_dataset(gpt4_responses_RAGAS)

# Create the dataset object
dataset = Dataset.from_dict(data_samples)

# Print to verify the structure
print(dataset)

Dataset({
    features: ['user_input', 'response', 'reference'],
    num_rows: 20
})


In [90]:
dataset[0]

{'user_input': 'Which NGAP procedure is used for inter-system load balancing? [3GPP Release 17]',
 'response': 'load balancing request',
 'reference': 'the ngap procedure used for intersystem load balancing is uplink ran configuration transfer'}

## Evaluate Llama 3.2 with RAGAS Metrics

### Using LLM to evaluate (Factual Correctness, Semantic similarity and Rubrics based criteria scoring)

In [91]:
from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.metrics._factual_correctness import FactualCorrectness
from ragas.metrics import SemanticSimilarity
from ragas.metrics import RubricsScoreWithReference

In [92]:
factualCorrectness = FactualCorrectness()
semantiSimilarity = SemanticSimilarity()
rubrics = {
    "score1_description": "The response is incorrect, irrelevant, or does not align with the ground truth.",
    "score2_description": "The response partially matches the ground truth but includes significant errors, omissions, or irrelevant information.",
    "score3_description": "The response generally aligns with the ground truth but may lack detail, clarity, or have minor inaccuracies.",
    "score4_description": "The response is mostly accurate and aligns well with the ground truth, with only minor issues or missing details.",
    "score5_description": "The response is fully accurate, aligns completely with the ground truth, and is clear and detailed.",
}
rubricsScoreWithReference =  RubricsScoreWithReference(rubrics=rubrics)

In [96]:
score = evaluate(
    dataset,
    metrics=[
        factualCorrectness,
        semantiSimilarity,
        rubricsScoreWithReference,
    ],
    llm=llm,
    embeddings=embeddings,
    run_config = RunConfig(timeout=600, max_retries=20, max_wait=180,log_tenacity=False),
)
score.to_pandas()

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[54]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[0]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[6]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[30]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[39]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[51]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[42]: TimeoutError()


Unnamed: 0,user_input,response,reference,factual_correctness,semantic_similarity,rubrics_score_with_reference
0,Which NGAP procedure is used for inter-system ...,load balancing request,the ngap procedure used for intersystem load b...,,0.541854,1
1,What is covered by enhanced application layer ...,enhanced application layer support for v2x ser...,enhanced application layer support for v2x ser...,,0.823039,3
2,What does the Load-Balancing steering mode do?...,loadbalancing steering mode optimizes ue distr...,the loadbalancing steering mode splits the tra...,,0.521146,2
3,What is the main objective of intent driven ma...,intentdriven management aims to simplify and a...,the intent driven management aims to reduce th...,,0.717422,3
4,What does MINT stand for? [3GPP Release 17],mint,mint stands for minimization of service interr...,0.67,0.635801,1
5,What is the purpose of the Media Streaming AF ...,the purpose of the media streaming af event ex...,the work item relates to the support of generi...,0.0,0.56753,2
6,What is the purpose of load-balancing steering...,optimize network resource distribution and imp...,in rel17 loadbalancing steering mode enhanceme...,0.33,0.313741,3
7,What is a capability added in the V2X Applicat...,support for multiapplication use cases,v2x service discovery across multiple v2x serv...,0.0,0.286697,2
8,What is the purpose of the Edge Data Network (...,the edge data network edn supports edge applic...,the edge data network edn hosts the edge appli...,0.29,0.792013,2
9,What are the three features specified in TS 23...,direct communication capability discovery and ...,the three features specified in ts 23304 for 5...,0.2,0.387232,2


In [97]:
score

{'factual_correctness': 0.3155, 'semantic_similarity': 0.6106, 'rubrics_score_with_reference': 2.1500}

In [98]:
gpt4_evaluation_RAGAS_LLM = score.to_pandas()
# gpt4_evaluation_RAGAS_LLM.to_csv("../Evaluations/RAGAS/gpt4_evaluation_RAGAS_LLM.csv", index=False)

In [99]:
import pandas as pd
result = pd.read_csv("../Evaluations/RAGAS/gpt4_evaluation_RAGAS_LLM.csv")

In [100]:
result

Unnamed: 0,user_input,response,reference,factual_correctness,semantic_similarity,rubrics_score_with_reference
0,Which NGAP procedure is used for inter-system ...,load balancing request,the ngap procedure used for intersystem load b...,,0.541854,1
1,What is covered by enhanced application layer ...,enhanced application layer support for v2x ser...,enhanced application layer support for v2x ser...,,0.823039,3
2,What does the Load-Balancing steering mode do?...,loadbalancing steering mode optimizes ue distr...,the loadbalancing steering mode splits the tra...,,0.521146,2
3,What is the main objective of intent driven ma...,intentdriven management aims to simplify and a...,the intent driven management aims to reduce th...,,0.717422,3
4,What does MINT stand for? [3GPP Release 17],mint,mint stands for minimization of service interr...,0.67,0.635801,1
5,What is the purpose of the Media Streaming AF ...,the purpose of the media streaming af event ex...,the work item relates to the support of generi...,0.0,0.56753,2
6,What is the purpose of load-balancing steering...,optimize network resource distribution and imp...,in rel17 loadbalancing steering mode enhanceme...,0.33,0.313741,3
7,What is a capability added in the V2X Applicat...,support for multiapplication use cases,v2x service discovery across multiple v2x serv...,0.0,0.286697,2
8,What is the purpose of the Edge Data Network (...,the edge data network edn supports edge applic...,the edge data network edn hosts the edge appli...,0.29,0.792013,2
9,What are the three features specified in TS 23...,direct communication capability discovery and ...,the three features specified in ts 23304 for 5...,0.2,0.387232,2


### No need LLM to evaluate (BleuScore, RougeScore, ExactMatch and StringPresence)

In [78]:
from ragas.metrics import BleuScore, RougeScore, ExactMatch, StringPresence
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/arimatea/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [79]:
bleuScore = BleuScore()
rougeScore = RougeScore()
exactMatch = ExactMatch()
stringPresence = StringPresence()

In [80]:
score = evaluate(
    dataset,
    metrics=[
        bleuScore,
        rougeScore,
        exactMatch,
        stringPresence
    ],
    llm=llm,
    embeddings=embeddings
)
score.to_pandas()

Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
ERROR:ragas.executor:Exception raised in Job[328]: AssertionError(Expecting a float)
ERROR:ragas.executor:Exception raised in Job[360]: AssertionError(Expecting a float)
ERROR:ragas.executor:Exception raised in Job[316]: AssertionError(Expecting a float)
ERROR:ragas.executor:Exception raised in Job[268]: AssertionError(Expe

Unnamed: 0,user_input,response,reference,bleu_score,rouge_score,exact_match,string_present
0,Which NGAP procedure is used for inter-system ...,load balancing request,the ngap procedure used for intersystem load b...,4.043371e-156,0.250000,0.0,0.0
1,What is covered by enhanced application layer ...,enhanced application layer support for v2x ser...,enhanced application layer support for v2x ser...,1.839873e-01,0.363636,0.0,0.0
2,What does the Load-Balancing steering mode do?...,loadbalancing steering mode optimizes ue distr...,the loadbalancing steering mode splits the tra...,1.886978e-78,0.240000,0.0,0.0
3,What is the main objective of intent driven ma...,intentdriven management aims to simplify and a...,the intent driven management aims to reduce th...,1.943368e-78,0.166667,0.0,0.0
4,What does MINT stand for? [3GPP Release 17],mint,mint stands for minimization of service interr...,4.515870e-234,0.250000,0.0,0.0
...,...,...,...,...,...,...,...
95,Which RRC state is the UE in when no RRC conne...,rrc idle,when no rrc connection is established the ue i...,1.032235e-233,0.142857,0.0,0.0
96,How are the antenna elements placed on each an...,antenna elements are placed in configurations ...,the document states that the antenna elements ...,2.061735e-01,0.410256,0.0,0.0
97,What information may be provided to an emergen...,user location information user identity inform...,for emergency services the geographic location...,3.297205e-232,0.133333,0.0,0.0
98,What is the purpose of cross-network slice coo...,the purpose of crossnetwork slice coordination...,crossnetwork slice coordination enables the co...,2.230469e-78,0.264151,0.0,0.0


In [81]:
score

{'bleu_score': 0.0721, 'rouge_score': 0.2385, 'exact_match': 0.0000, 'string_present': 0.0000}

In [82]:
gpt4_evaluation_RAGAS_no_LLM = score.to_pandas()
# gpt4_evaluation_RAGAS_no_LLM.to_csv("../Evaluations/RAGAS/gpt4_evaluation_RAGAS_no_LLM.csv", index=False)

In [83]:
import pandas as pd
result = pd.read_csv("../Evaluations/RAGAS/gpt4_evaluation_RAGAS_no_LLM.csv")

In [84]:
result

Unnamed: 0,user_input,response,reference,bleu_score,rouge_score,exact_match,string_present
0,Which NGAP procedure is used for inter-system ...,load balancing request,the ngap procedure used for intersystem load b...,4.043371e-156,0.250000,0.0,0.0
1,What is covered by enhanced application layer ...,enhanced application layer support for v2x ser...,enhanced application layer support for v2x ser...,1.839873e-01,0.363636,0.0,0.0
2,What does the Load-Balancing steering mode do?...,loadbalancing steering mode optimizes ue distr...,the loadbalancing steering mode splits the tra...,1.886978e-78,0.240000,0.0,0.0
3,What is the main objective of intent driven ma...,intentdriven management aims to simplify and a...,the intent driven management aims to reduce th...,1.943368e-78,0.166667,0.0,0.0
4,What does MINT stand for? [3GPP Release 17],mint,mint stands for minimization of service interr...,4.515870e-234,0.250000,0.0,0.0
...,...,...,...,...,...,...,...
95,Which RRC state is the UE in when no RRC conne...,rrc idle,when no rrc connection is established the ue i...,1.032235e-233,0.142857,0.0,0.0
96,How are the antenna elements placed on each an...,antenna elements are placed in configurations ...,the document states that the antenna elements ...,2.061735e-01,0.410256,0.0,0.0
97,What information may be provided to an emergen...,user location information user identity inform...,for emergency services the geographic location...,3.297205e-232,0.133333,0.0,0.0
98,What is the purpose of cross-network slice coo...,the purpose of crossnetwork slice coordination...,crossnetwork slice coordination enables the co...,2.230469e-78,0.264151,0.0,0.0
