### **Load jsonl from dataset directory**


In [1]:
import json
import os  

def read_data(filename, dataset_dir="../PragmatiCQA/data"):
    corpus = []
    with open(os.path.join(dataset_dir, filename), 'r') as f:
        for line in f:
            corpus.append(json.loads(line))
    return corpus

pcqa_test = read_data("test.jsonl")

In [2]:
print(f"Loaded {len(pcqa_test)} documents from PragmatiCQA test set.")

Loaded 213 documents from PragmatiCQA test set.


### **Pretty print the first document to check the structure**


In [3]:
import pprint   
pprint.pprint(pcqa_test[0])  # Print the first document to check the structure

{'community': 'The Legend of Zelda',
 'genre': 'Games',
 'qas': [{'a': 'The Legend of Zelda came out as early as 1986 for the Famicom '
               'in Japan, and was later released in the western world, '
               'including Europe and the US in 1987. Would you like to know '
               'about the story?',
          'a_meta': {'literal_obj': [{'endKey': '9cbccabd-66be-4a46-bd8b-f59a299c987d',
                                      'startKey': '1f4f808a-8560-4894-b892-15fa3c33887a',
                                      'text': 'FDS release February 21, '
                                              '1986\n'},
                                     {'endKey': '738bff65-b4f9-4660-bd18-79722ed67a40',
                                      'startKey': 'a0d9d5c5-18bb-4be4-825e-fca2900db18e',
                                      'text': 'The Legend of Zelda is the '
                                              'first installment of the Zelda '
                                   

In [4]:
print(f"Loaded {len(pcqa_test[0]['qas'])} questions from the first document.")

Loaded 6 questions from the first document.


In [5]:
pprint.pprint(pcqa_test[0]['qas'][0])

{'a': 'The Legend of Zelda came out as early as 1986 for the Famicom in Japan, '
      'and was later released in the western world, including Europe and the '
      'US in 1987. Would you like to know about the story?',
 'a_meta': {'literal_obj': [{'endKey': '9cbccabd-66be-4a46-bd8b-f59a299c987d',
                             'startKey': '1f4f808a-8560-4894-b892-15fa3c33887a',
                             'text': 'FDS release February 21, 1986\n'},
                            {'endKey': '738bff65-b4f9-4660-bd18-79722ed67a40',
                             'startKey': 'a0d9d5c5-18bb-4be4-825e-fca2900db18e',
                             'text': 'The Legend of Zelda is the first '
                                     'installment of the Zelda series. '},
                            {'endKey': '738bff65-b4f9-4660-bd18-79722ed67a40',
                             'startKey': '738bff65-b4f9-4660-bd18-79722ed67a40',
                             'text': ' It centers its plot around a boy named 

# **4.2 Part 0: Dataset Analysis** 

## **Key motivations & contributions — Summary**

### *Motivation*: 
- Standard ConvQA benchmarks mainly check if a system finds the exact answer(litteral answer). However, real conversations require understanding hidden intentions, giving extra helpful information, and dealing with differences in what the asker and answerer know. 

### *Contribution*:
1. A conversational question answering (ConvQA) dataset in the open domain that includes pragmatic answers and provides quantitative metrics for assessing pragmatic reasoning.
2. A crowdsourcing framework for PRAGMATICQA that addresses incentive misalignment, resulting in realistic, high-quality, and diverse data.
3. Analyzing the dataset and showing the uniqueness and the importance of its challanges to the current ConvQA systems.
 
### *Why it is important*:
- Most baseline models (like FiD and DPR) miss over 90% of the pragmatic information. This means there is a lot of room for improvement, and focusing only on literal answers is not enough for good conversational agents.

### *Why PRAGMATICQA is challenging* : 

   - Requires inferring user intent beyond literal questions
   - Follow-up question anticipation 
   - Must handle diverse domains and conversational contexts
   - Requires sophisticated reasoning about what information is relevant

## **Printing and analyzing 5 examples from the dataset**

In [6]:
pprint.pprint(pcqa_test[0])

{'community': 'The Legend of Zelda',
 'genre': 'Games',
 'qas': [{'a': 'The Legend of Zelda came out as early as 1986 for the Famicom '
               'in Japan, and was later released in the western world, '
               'including Europe and the US in 1987. Would you like to know '
               'about the story?',
          'a_meta': {'literal_obj': [{'endKey': '9cbccabd-66be-4a46-bd8b-f59a299c987d',
                                      'startKey': '1f4f808a-8560-4894-b892-15fa3c33887a',
                                      'text': 'FDS release February 21, '
                                              '1986\n'},
                                     {'endKey': '738bff65-b4f9-4660-bd18-79722ed67a40',
                                      'startKey': 'a0d9d5c5-18bb-4be4-825e-fca2900db18e',
                                      'text': 'The Legend of Zelda is the '
                                              'first installment of the Zelda '
                                   

#### **Q1: "What year did The Legend of Zelda come out?"**

***Literal answer (literal_obj):***

- "FDS release February 21, 1986"

- Short, fact-based.

***Pragmatic answer (pragmatic_obj):***

- "It came out as early as 1986 for the Famicom in Japan, and was later released in the western world, including Europe and the US in 1987."

- Provides broader context: regional release differences.

***Enrichment***:
The literal answer provides only the initial release date in Japan, but the pragmatic answer enriches it by contextualizing the global release (Japan in 1986, US/Europe in 1987). This anticipates what the learner likely cares about — when the game became widely available — and thus goes beyond the minimal, “non-cooperative teacher” style.

In [None]:
pprint.pprint(pcqa_test[25])

{'community': 'LEGO',
 'genre': 'Lifestyle',
 'qas': [{'a': 'In 1916 Christiansen bought the woodworking shop in Billund, '
               'Denmark.',
          'a_meta': {'literal_obj': [{'endKey': '146216fe-12a2-4579-9ec5-135ca606df37',
                                      'startKey': '3a58696a-da2e-47ab-91be-df25b5e9cbcf',
                                      'text': 'The LEGO Group had humble '
                                              'beginnings, starting in the '
                                              'workshop of Ole Kirk '
                                              'Christiansen , a carpenter from '
                                              'Billund , Denmark.'}],
                     'pragmatic_obj': [{'endKey': '3a58696a-da2e-47ab-91be-df25b5e9cbcf',
                                        'startKey': '3a58696a-da2e-47ab-91be-df25b5e9cbcf',
                                        'text': 'In 1916 , Christiansen bought '
                                   

#### **Q1: “How old is LEGO?”**

***Literal answer (literal_obj):***

- "The LEGO Group had humble beginnings, starting in the workshop of Ole Kirk Christiansen, a carpenter from Billund, Denmark."

- This gives origin information but doesn’t directly answer the question "how old?"

***Pragmatic answer (pragmatic_obj):***

- "In 1916, Christiansen bought a woodworking shop in Billund."

- This is much more directly relevant — it anchors the company’s age to a specific founding year (1916).

***Enrichment***:
The pragmatic answer improves on the literal one by actually providing a temporal anchor for calculating LEGO’s age. The literal answer gives only background about the founder’s profession, which doesn’t resolve the "how old?" question.

In [62]:
pprint.pprint(pcqa_test[51])

{'community': 'Kung Fu Panda',
 'genre': 'Movies',
 'qas': [{'a': 'Kunt Fu Panda is an American computer-animated comedy and '
               'action film. It had been produced by Dreamworks Animation in '
               '2008.',
          'a_meta': {'literal_obj': [{'endKey': '04d2394a-bffe-4184-8da0-5c5ebd4e0ec5',
                                      'startKey': '04d2394a-bffe-4184-8da0-5c5ebd4e0ec5',
                                      'text': 'American computer-animated '
                                              'action/comedy film'}],
                     'pragmatic_obj': [{'endKey': '04d2394a-bffe-4184-8da0-5c5ebd4e0ec5',
                                        'startKey': '04d2394a-bffe-4184-8da0-5c5ebd4e0ec5',
                                        'text': 'produced by DreamWorks '
                                                'Animation'},
                                       {'endKey': '04d2394a-bffe-4184-8da0-5c5ebd4e0ec5',
                                      

#### **Q6: "What type of restaurant does Po work at?"**

***Literal answer (literal_obj):***

- "Noodle shop."

- just a category

***Pragmatic answer (pragmatic_obj):***

- "Dragon Warrior Noodles & Tofu (Mr. Ping’s Noodle Shop), a popular place in the Valley of Peace."

- gives name, owner, reputation, and location

***Enrichment***:
The pragmatic answer provides a richer context by including the restaurant's name, owner, reputation, and location, which are all relevant to understanding the setting of Po's life and the significance of the restaurant in the story.

In [64]:
pprint.pprint(pcqa_test[92])

{'community': 'Peanuts Comics',
 'genre': 'Comics',
 'qas': [{'a': " I don't know exactly how old snoopy is there are some rumours "
               'that he could be over 50 yeas old.  Would you like to know '
               "more about Snoopy's personality?",
          'a_meta': {'literal_obj': [{'endKey': 0,
                                      'startKey': 0,
                                      'text': "I don't know"}],
                     'pragmatic_obj': [{'endKey': 'ed3619e6-f51a-469e-9eb1-1c4ab639ce3b',
                                        'startKey': 'ed3619e6-f51a-469e-9eb1-1c4ab639ce3b',
                                        'text': 'Snoopy is loyal, funny, '
                                                'imaginative and good-natured. '
                                                'He is also a genuinely happy '
                                                'dog. '}]},
          'human_eval': ['2', '5', '4', '1', '3'],
          'q': 'How old is Snoopy?'},
    

#### **Q3: "Of course! Tell me about his relationship with Lucy?"**

***Literal answer (literal_obj):***

- "Snoopy frequently tries to kiss Lucy on the cheek or nose. Lucy is afraid of dog germs and hates these actions, which sometimes results in her injuring Snoopy."

- A direct, surface-level description of their interactions

***Pragmatic answer (pragmatic_obj):***

- "Despite their rivalry towards each other, both seem to care for one another. In one strip, Lucy even admits: ‘You know there are times when you really bug me!’"

- Goes beyond just actions, giving emotional nuance and showing their bond

***Enrichment***:
The pragmatic answer enriches the literal one by highlighting the emotional complexity of Snoopy and Lucy's relationship, showing that despite their conflicts, there is a deeper care between them. This adds a layer of understanding that goes beyond just actions to the feelings involved.

In [69]:
pprint.pprint(pcqa_test[125])


{'community': 'Britney Spears',
 'genre': 'Music',
 'qas': [{'a': 'Yes, Britney Spears is American as she was born in McComb, '
               'Mississippi. ',
          'a_meta': {'literal_obj': [{'endKey': 0,
                                      'startKey': 0,
                                      'text': 'Yes'}],
                     'pragmatic_obj': [{'endKey': '94c6c671-0be2-49d5-b163-e930773742cc',
                                        'startKey': '35cd2023-081c-4a39-a3bd-82a7c4315567',
                                        'text': 'Birthplace\n'
                                                'McComb, Mississippi, U.S.'}]},
          'human_eval': ['1', '1', '1', '1', '1'],
          'q': 'is she american?'},
         {'a': 'It is not known if Britney Spears was born a toe head with '
               'blonde hair, but while she was on Mickey Mouse club is was '
               'brown, so chances are she is not a natural blonde.',
          'a_meta': {'literal_obj': [{'endKey'

#### **Q1: "Is she American?"**

***Literal answer (literal_obj):***

- "Yes."

- direct answer , minimal info

***Pragmatic answer (pragmatic_obj):***

- "Birthplace: McComb, Mississippi, U.S. (Britney Spears was born in the United States)."

- Adds birthplace details and gives more context to the answer.

***Enrichment***:
The literal answer only affirms nationality. The pragmatic one not only confirms but elaborates (by referencing her birthplace), which gives supporting evidence and makes the response more informative.

## **Loading the splits of the dataset**

In [6]:
# Import required libraries for traditional NLP approach
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
import torch
import sys
sys.path.append('../')

# Check if we have GPU available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

pcqa_val = read_data("val.jsonl")
print(f"Loaded {len(pcqa_val)} validation documents")

pcqa_test = read_data("test.jsonl")
print(f"Loaded {len(pcqa_test)} test documents")

pcqa_train = read_data("train.jsonl")
print(f"Loaded {len(pcqa_train)} training documents")



Using device: cuda
Loaded 179 validation documents
Loaded 213 test documents
Loaded 476 training documents


In [7]:
pprint.pprint(pcqa_val[0])  # Print the first question to check the structure

{'community': 'A Nightmare on Elm Street',
 'genre': 'Movies',
 'qas': [{'a': 'Freddy Kruger is the nightmare in nighmare on Elm street. '
               'Please note, and to be very clear, the system that loads up '
               'wiki is not allowing access to Adam Prag, to the page... so '
               "I'll have to go from memory.  Normally you can paste things "
               "and back up what you are saying, but today that's not "
               'happening. alas.',
          'a_meta': {'literal_obj': [{'endKey': None,
                                      'startKey': None,
                                      'text': 'Cannot GET /wiki/A%20N'}],
                     'pragmatic_obj': [{'endKey': None,
                                        'startKey': None,
                                        'text': 'Cannot GET /wiki/A%20N'}]},
          'human_eval': ['1', '1', '1', '1', '1'],
          'q': 'who is freddy krueger?'},
         {'a': 'Yes and no, it means I can be lighti

In [8]:
# pip install sentence_transformers
import dspy
from sentence_transformers import SentenceTransformer

# Load an extremely efficient local model for retrieval
model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device=device)

embedder = dspy.Embedder(model.encode)
embeddings = embedder(["hello", "world"], batch_size=1)

assert embeddings.shape == (2, 1024)

In [9]:
# Traverse a directory and read html files - extract text from the html files
import os
from bs4 import BeautifulSoup
def read_html_files(directory):
    texts = []
    for filename in os.listdir(directory):
        if filename.endswith(".html"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file, 'html.parser')
                texts.append(soup.get_text())
    return texts


In [10]:
max_characters = 10000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

corpus = read_html_files("../PragmatiCQA-sources/The Legend of Zelda")
print(f"Loaded {len(corpus)} documents. Will encode them below.")

# embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 406 documents. Will encode them below.


In [11]:
search("What is the main quest in The Legend of Zelda?")
# This will return the top 5 documents related to the query about the main quest in The Legend of Zelda.
# You can adjust the query to test different aspects of the corpus.

Prediction(
    passages=['\n\n\n\n\n\n\n\n\n       Second Quest\n      \n\n\n\n\n\n\n        The Second Quest from\n        \n         The Wind Waker HD\n        \n\n\n\n\n\n\n      Game(s)\n     \n\n\n\n        The Legend of Zelda\n       \n\n\n\n        The Adventure of Link\n       \n\n\n\n        The Wind Waker\n       \n\n\n\n\n\n      Other media\n     \n\n\n\n        Zelda\n       \n       (Game & Watch)\n      \n\n\n\n\n      Feature(s)\n     \n\n      Increased difficulty\n      Alternate Dungeons\n     \n\n\n\n\n   The\n   \n    Second Quest\n   \n   ,\n   also known as\n   \n    Second Round\n   \n   ,\n   is a recurring mode in\n   \n\n     The Legend of Zelda\n    \n    series\n   \n   .\n   The Second Quest goes unnamed in\n   \n\n     The Adventure of Link\n    \n\n   .\n  \n\n\n\n     Contents\n    \n\n\n\n\n\n       1\n      \n\n       Overview\n      \n\n\n\n\n\n         1.1\n        \n\n\n          The Legend of Zelda\n         \n\n\n\n\n\n\n         1.2\n        \n

### **Loading and mapping each community to its corresponding corpus**

In [12]:

def make_mapping(dataset):
    sources_dir = "../PragmatiCQA-sources"
    available_dirs = [d for d in os.listdir(sources_dir) 
                     if os.path.isdir(os.path.join(sources_dir, d))]
    
    corpuses = {}
    for conv in dataset:
        community = conv["community"]
        if community not in corpuses:
            corpuses[community] = read_html_files(f"../PragmatiCQA-sources/{community}")
            print(f"✓ Loaded {len(corpuses[community])} docs for community: '{community}''")
    
    return corpuses

In [13]:
datasets_for_mapping = pcqa_test + pcqa_val + pcqa_train
corpuses_mapping = make_mapping(dataset=datasets_for_mapping)

✓ Loaded 406 docs for community: 'The Legend of Zelda''
✓ Loaded 476 docs for community: 'LEGO''
✓ Loaded 465 docs for community: 'Kung Fu Panda''
✓ Loaded 392 docs for community: 'Peanuts Comics''
✓ Loaded 495 docs for community: 'Mystery Science Theater 3000''
✓ Loaded 310 docs for community: 'Britney Spears''
✓ Loaded 165 docs for community: 'Throne of Glass''
✓ Loaded 499 docs for community: 'Studio Ghibli''
✓ Loaded 484 docs for community: 'Fallout''
✓ Loaded 498 docs for community: 'Baseball''
✓ Loaded 250 docs for community: 'A Nightmare on Elm Street''
✓ Loaded 496 docs for community: 'Batman''
✓ Loaded 46 docs for community: 'Supernanny''
✓ Loaded 257 docs for community: 'Hamilton the Musical''
✓ Loaded 499 docs for community: 'Wizard of Oz''
✓ Loaded 367 docs for community: 'Jujutsu Kaisen''
✓ Loaded 195 docs for community: 'Enter the Gungeon''
✓ Loaded 499 docs for community: 'Dinosaur''
✓ Loaded 250 docs for community: 'The Karate Kid''
✓ Loaded 495 docs for community: 'Pop

In [14]:
print(len(corpuses_mapping))

54


In [15]:
print(len(pcqa_val))

179


In [16]:
pprint.pprint(pcqa_val[10])  # Print the first document to check the structure

{'community': 'Batman',
 'genre': 'Comics',
 'qas': [{'a': 'Yes, he has been the protector of Gotham City for a long time.',
          'a_meta': {'literal_obj': [{'endKey': 0,
                                      'startKey': 0,
                                      'text': 'Yes'}],
                     'pragmatic_obj': [{'endKey': '7fa976b1-4e71-4d48-a787-1bc50171a2b1',
                                        'startKey': '40abceed-4e56-4394-90e0-5e5d22a8268e',
                                        'text': 'Batman has been Gotham City '
                                                "'s protector for decades"}]},
          'human_eval': ['2', '3', '2', '2', '2'],
          'q': 'Ok, Is batman a superhero?'},
         {'a': 'I am not sure of the exact moment, but he did see his parents '
               'get murded as a child, and that eventually led him to be a '
               'crime fighter. His secret identity is Bruce Wayne.',
          'a_meta': {'literal_obj': [{'endKey': 0,
  

In [16]:
print(pcqa_val[0])

{'topic': 'A Nightmare on Elm Street (2010 film)', 'genre': 'Movies', 'community': 'A Nightmare on Elm Street', 'qas': [{'q': 'who is freddy krueger?', 'a_meta': {'literal_obj': [{'text': 'Cannot GET /wiki/A%20N', 'startKey': None, 'endKey': None}], 'pragmatic_obj': [{'text': 'Cannot GET /wiki/A%20N', 'startKey': None, 'endKey': None}]}, 'a': "Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.", 'human_eval': ['1', '1', '1', '1', '1']}, {'q': 'oh man, that sucks.', 'a_meta': {'literal_obj': [{'text': 'Cannot GET /wiki/A%20N', 'startKey': None, 'endKey': None}], 'pragmatic_obj': [{'text': 'Cannot GET /wiki/A%20N', 'startKey': None, 'endKey': None}]}, 'a': "Yes and no, it means I can be lighting quick, especially since I type quickly,

### **TraditionalQA model**

In [16]:
# Create DSPy module for DistilBERT QA model (Traditional NLP approach)
from transformers import pipeline
import dspy

class distilbert(dspy.Module):
    """DSPy module wrapping Hugging Face DistilBERT for extractive QA"""
    
    def __init__(self, model_name="distilbert-base-cased-distilled-squad"):
        super().__init__()
        print(f"Loading {model_name}...")
        self.qa_pipeline = pipeline(
            "question-answering",
            model=model_name,
            tokenizer=model_name,
            device=device
        )
        print("✅ QA model loaded successfully!")
    
    def forward(self, context, question):
        """Extract answer from context using DistilBERT"""
        try:
            result = self.qa_pipeline(question=question, context=context)
            return dspy.Prediction(response=result['answer'], confidence=result['score'])
        except Exception as e:
            print(f"Error in QA: {e}")
            return dspy.Prediction(response="", confidence=0.0)

# Initialize the traditional QA model
traditional_qa = distilbert()

Loading distilbert-base-cased-distilled-squad...


Device set to use cuda


✅ QA model loaded successfully!


### **Creating retriever for each community(corpus)**

In [17]:
# Create retrieval setup for each community/topic
def create_retrievers(corpuses_mapping, embedder, k=5):
    """Create retrievers for each community in the corpus mapping"""
    retrievers = {}
    for community, corpus in corpuses_mapping.items():
        if len(corpus) > 0:  # Only create retriever if corpus has documents
            retrievers[community] = dspy.retrievers.Embeddings(
                embedder=embedder, 
                corpus=corpus, 
                k=k
            )
    return retrievers

In [33]:
# Create retrievers for all communities
print("Creating retrievers for all communities...")
retrievers = create_retrievers(corpuses_mapping, embedder, k=5)
print(f"Created {len(retrievers)} retrievers")

Creating retrievers for all communities...
Created 54 retrievers


### **Methods for getting different types of contexts**

In [19]:
# Context creation functions for the three configurations

def get_literal_context(qa_item):
    """Get context from literal spans"""
    literal_spans = qa_item.get('a_meta', {}).get('literal_obj', [])
    if isinstance(literal_spans, list):
        return ' '.join([span.get('text', '') for span in literal_spans if isinstance(span, dict)])
    return ""

def get_pragmatic_context(qa_item):
    """Get context from pragmatic spans"""
    pragmatic_spans = qa_item.get('a_meta', {}).get('pragmatic_obj', [])
    if isinstance(pragmatic_spans, list):
        return ' '.join([span.get('text', '') for span in pragmatic_spans if isinstance(span, dict)])
    return ""

def get_retrieved_context(question, community, retrievers ):
    """Get context from retriever"""
    if community in retrievers:
        try:
            retrieved = retrievers[community](question)
            if hasattr(retrieved, 'passages'):
                return ' '.join(retrieved.passages)
            elif isinstance(retrieved, list):
                return ' '.join([str(passage) for passage in retrieved])
        except Exception as e:
            print(f"Retrieval error for community {community}: {e}")
    return ""

def get_retrieved_context_maxchars(question, community, retrievers, max_characters_per_passage=10000):
    """Get context from retriever, truncating each passage to min(len(passage), max_characters_per_passage)."""
    if community in retrievers:
        try:
            retrieved = retrievers[community](question)
            if hasattr(retrieved, 'passages'):
                context = ' '.join([p[:min(len(p), max_characters_per_passage)] for p in retrieved.passages])
            elif isinstance(retrieved, list):
                context = ' '.join([str(passage)[:min(len(str(passage)), max_characters_per_passage)] for passage in retrieved])
            else:
                context = ""
            return context
        except Exception as e:
            print(f"Retrieval error for community {community}: {e}")
    return

def answer_question(qa_model, context, question):
    """Answer question using the QA model"""
    if not context or not context.strip():
        return {"answer": "", "confidence": 0.0}
    
    try:
        result = qa_model(context=context, question=question)
        return {"answer": result.response, "confidence": result.confidence}
    except Exception as e:
        print(f"QA error: {e}")
        return {"answer": "", "confidence": 0.0}

In [None]:
# Set up SemanticF1 evaluation with Ollama
try:
    from dspy.evaluate import SemanticF1
    
    # Configure Ollama LLM for SemanticF1
    # lm = dspy.LM('ollama/qwen3:4b-instruct', api_base='http://localhost:11434')
    lm = dspy.LM('xai/grok-3-mini', api_key="api_key_here")
    dspy.configure(lm=lm)
    semantic_f1 = SemanticF1()
    print("✅ Using grok for SemanticF1")
    
except Exception as e:
    print(f"⚠️ Error setting up Ollama LLM: {e}")
    print("Make sure Ollama is running and the model is available")
    semantic_f1 = None

✅ Using grok for SemanticF1


### **Getting the results for different types of contexts**
- ***NOTE : didn't use the dspy.Evaluate method here because this is local and this was done fast since I have cuda.***
- ***For the LLM based RAG I used the dspy.Evaluate method as it is more convenient.***

In [146]:
from tqdm import tqdm

print("Starting evaluation on first questions from validation set...")



# Process each first question with tqdm progress bar
results = { 
    "gt":{
        "literal_context": [],
        "pragmatic_context": [],
        "retrieved_context": []
    },
    "pred":{
        "literal_context": [],
        "pragmatic_context": [],
        "retrieved_context": []
    }
}
for i, conv in tqdm(enumerate(pcqa_val), total=len(pcqa_val), desc="Evaluating questions"):
    qa_item = conv['qas'][0] 
    question = qa_item['q']
    ground_truth = qa_item['a']
    
    # Find the community for this question (from the parent document)
    community = conv["community"]
    
    if not community:
        print(f"Warning: Could not find community for question {i}")
        continue
    
    # Configuration 1: Literal context
    literal_context = get_literal_context(qa_item)
    if literal_context:
        literal_result = answer_question(traditional_qa, literal_context, question)
        results["gt"]["literal_context"].append(dspy.Example(question=question,response=ground_truth,inputs={'context': literal_context}))
        results["pred"]["literal_context"].append(dspy.Example(question=question,response=literal_result["answer"],inputs={'context': literal_context}))
        # results["pred"]["literal_context"].append(literal_result)
    # Configuration 2: Pragmatic context  
    pragmatic_context = get_pragmatic_context(qa_item)
    if pragmatic_context:
        pragmatic_result = answer_question(traditional_qa, pragmatic_context, question)
        results["gt"]["pragmatic_context"].append(dspy.Example(question=question,response=ground_truth,inputs={'context': pragmatic_context}))
        results["pred"]["pragmatic_context"].append(dspy.Example(question=question,response=pragmatic_result["answer"],inputs={'context': pragmatic_context}))
        # results["pred"]["pragmatic_context"].append(pragmatic_result)
    # Configuration 3: Retrieved context
    retrieved_context = get_retrieved_context(question, community, retrievers)
    if retrieved_context:
        retrieved_result = answer_question(traditional_qa, retrieved_context, question)
        results["gt"]["retrieved_context"].append(dspy.Example(question=question,response=ground_truth,inputs={'context': retrieved_context}))
        results["pred"]["retrieved_context"].append(dspy.Example(question=question,response=retrieved_result["answer"],inputs={'context': retrieved_context}))
        # results["pred"]["retrieved_context"].append(retrieved_result)
print("✅ Evaluation completed!")

Starting evaluation on first questions from validation set...


Evaluating questions:   2%|▏         | 3/179 [00:01<00:58,  3.01it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Evaluating questions: 100%|██████████| 179/179 [01:28<00:00,  2.03it/s]

✅ Evaluation completed!





### **Function for evalutating the results**

In [21]:
def evaluate_results(ground_truth, predictions):
    """
    Evaluate results using SemanticF1 for individual evaluation
    Args:
        ground_truth: List of DSPy Examples with ground truth responses
        predictions: List of DSPy Examples with predicted responses
    Returns:
        Dictionary with precision, recall, f1 scores
    """
    if not ground_truth or not predictions:
        print("No data to evaluate")
        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "count": 0}
    
    if len(ground_truth) != len(predictions):
        print(f"Warning: Mismatch in lengths - GT: {len(ground_truth)}, Pred: {len(predictions)}")
        raise ValueError("Ground truth and predictions must have the same length")

    # Evaluate each example individually with progress bar
    total_precision = 0.0
    total_recall = 0.0
    total_f1 = 0.0
    
    for gt_example, pred_example in tqdm(zip(ground_truth, predictions), 
                                         total=len(ground_truth), 
                                         desc="Evaluating examples"):
        try:
            score = semantic_f1(gt_example, pred_example)
            
            # Extract precision, recall, F1 from the score object
            if hasattr(score, 'precision') and hasattr(score, 'recall') and hasattr(score, 'f1'):
                precision = score.precision
                recall = score.recall
                f1 = score.f1
            else:
                # If score is just a number, treat it as F1 and approximate others
                f1 = float(score)
                precision = f1  # Approximation
                recall = f1     # Approximation
            
            total_precision += precision
            total_recall += recall
            total_f1 += f1
            
        except Exception as e:
            print(f"Error evaluating example: {e}")
            
    # Return average scores
    count = len(ground_truth)
    return {
        "precision": total_precision / count if count > 0 else 0.0,
        "recall": total_recall / count if count > 0 else 0.0,
        "f1": total_f1 / count if count > 0 else 0.0,
        "count": count
    }


In [148]:
configurations = ["literal_context", "pragmatic_context", "retrieved_context"]
print("=" * 60)
print("SEMANTIC F1 EVALUATION RESULTS")
print("=" * 60)


print("\n" + "=" * 60)
print("SUMMARY TABLE")
print("=" * 60)
print(f"{'Configuration':<18} | {'Precision':<9} | {'Recall':<9} | {'F1':<9} | {'Count':<5}")
print("-" * 60)

# Re-run for summary table
for config in configurations:
    gt_examples = results["gt"][config]
    pred_examples = results["pred"][config]
    # if(config=="retrieved_context"):
    #     print(gt_examples[0]["inputs"]["context"])
    if gt_examples and pred_examples:
        scores = evaluate_results(gt_examples, pred_examples)
        config_name = config.replace('_', ' ').title()
        print(f"{config_name:<18} | {scores['precision']:<9.4f} | {scores['recall']:<9.4f} | {scores['f1']:<9.4f} | {scores['count']:<5d}")

print("=" * 60)

SEMANTIC F1 EVALUATION RESULTS

SUMMARY TABLE
Configuration      | Precision | Recall    | F1        | Count
------------------------------------------------------------


Evaluating examples: 100%|██████████| 179/179 [30:19<00:00, 10.16s/it]


Literal Context    | 0.4282    | 0.4282    | 0.4282    | 179  


Evaluating examples: 100%|██████████| 179/179 [25:16<00:00,  8.47s/it]


Pragmatic Context  | 0.3893    | 0.3893    | 0.3893    | 179  


Evaluating examples: 100%|██████████| 179/179 [23:33<00:00,  7.90s/it]

Retrieved Context  | 0.0905    | 0.0905    | 0.0905    | 179  





### **Saving the Distillbert model results to a json file**

In [None]:
# Save results to JSON file
import json
from datetime import datetime

# Prepare results for JSON serialization
json_results = {
    "metadata": {
        "timestamp": datetime.now().isoformat(),
        "model": "distilbert-base-cased-distilled-squad",
        "evaluator": "SemanticF1 with Ollama qwen3:4b-instruct",
        "dataset": "PragmatiCQA validation set (first questions only)"
    },
    "configurations": {}
}

# Add results for each configuration
for config in configurations:
    gt_examples = results["gt"][config]
    pred_examples = results["pred"][config]
    
    if gt_examples and pred_examples:
        # Get evaluation scores
        scores = evaluate_results(gt_examples, pred_examples)
        
        # Convert examples to serializable format
        gt_serializable = []
        pred_serializable = []
        
        for gt_ex, pred_ex in zip(gt_examples, pred_examples):
            gt_serializable.append({
                "question": gt_ex.question,
                "response": gt_ex.response,
                "context": gt_ex["inputs"]["context"] if hasattr(gt_ex, 'inputs') else ''
            })
            pred_serializable.append({
                "question": pred_ex.question,
                "response": pred_ex.response,
                "context": pred_ex["inputs"]["context"] if hasattr(pred_ex, 'inputs') else ''
            })
        
        json_results["configurations"][config] = {
            "scores": {
                "precision": scores['precision'],
                "recall": scores['recall'],
                "f1": scores['f1'],
                "count": scores['count']
            },
            "ground_truth_examples": gt_serializable,
            "predicted_examples": pred_serializable
        }

# Save to file
output_filename = f"evaluation_results_distillbert.json"
with open(output_filename, 'w', encoding='utf-8') as f:
    json.dump(json_results, f, indent=2, ensure_ascii=False)

print(f"✅ Results saved to: {output_filename}")
print(f"📊 Total configurations saved: {len(json_results['configurations'])}")

# Display file size
import os
file_size = os.path.getsize(output_filename)
print(f"📁 File size: {file_size:,} bytes ({file_size/1024:.1f} KB)")

Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 2209.16it/s]
Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 2209.16it/s]
Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 2182.46it/s]
Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 2182.46it/s]
Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 2154.99it/s]



✅ Results saved to: evaluation_results_20250820_173534.json
📊 Total configurations saved: 3
📁 File size: 40,621,896 bytes (39669.8 KB)


### **Creating dspy examples to try dspy.Evaluate method**

In [None]:
from dspy.evaluate import Evaluate
from tqdm import tqdm

print("🔄 Starting DSPy evaluate method evaluation...")


qa_program = traditional_qa

# Prepare evaluation datasets for each configuration
evaluation_datasets = {}

for config in ["literal_context", "pragmatic_context", "retrieved_context"]:
    print(f"\n📊 Preparing evaluation dataset for {config}...")
    
    eval_examples = []
    
    for i, conv in tqdm(enumerate(pcqa_val), total=len(pcqa_val), desc=f"Processing {config}"):
        qa_item = conv['qas'][0]
        question = qa_item['q']
        ground_truth = qa_item['a']
        community = conv["community"]
        
        if not community:
            continue
        
        # Get context based on configuration
        if config == "literal_context":
            context = get_literal_context(qa_item)
        elif config == "pragmatic_context":
            context = get_pragmatic_context(qa_item)
        else:  # retrieved_context
            context = get_retrieved_context(question, community, retrievers)
        
        if context:
            # Create DSPy Example with inputs and expected output
            # The distilbert model expects 'context' and 'question' parameters
            example = dspy.Example(
                context=context,
                question=question,
                response=ground_truth  # Expected answer (using 'answer' to match distilbert output)
            ).with_inputs('context', 'question')
            
            eval_examples.append(example)
    
    evaluation_datasets[config] = eval_examples
    print(f"✅ Created {len(eval_examples)} examples for {config}")

print(f"\n🎯 Ready to evaluate with DSPy's evaluate method!")

🔄 Starting DSPy evaluate method evaluation...

📊 Preparing evaluation dataset for literal_context...


Processing literal_context: 100%|██████████| 179/179 [00:00<00:00, 119341.98it/s]


✅ Created 179 examples for literal_context

📊 Preparing evaluation dataset for pragmatic_context...


Processing pragmatic_context: 100%|██████████| 179/179 [00:00<00:00, 357684.81it/s]


✅ Created 179 examples for pragmatic_context

📊 Preparing evaluation dataset for retrieved_context...


Processing retrieved_context: 100%|██████████| 179/179 [00:20<00:00,  8.65it/s]

✅ Created 179 examples for retrieved_context

🎯 Ready to evaluate with DSPy's evaluate method!





### **Expirement for evaluating the results using the dspy.Evaluate method**

In [150]:
# Run DSPy evaluate with SemanticF1 metric
print("=" * 80)
print("DSPy EVALUATE METHOD RESULTS WITH SEMANTICF1")
print("=" * 80)

# Set up the evaluator
evaluator = Evaluate(
    devset=None,  # We'll pass datasets directly
    metric=semantic_f1,
    num_threads=10,  # Use 10 threads for parallel evaluation
    display_progress=True,
    display_table=5,  # Show top 3 examples
)

# Store results
dspy_eval_results = {}

for config, dataset in evaluation_datasets.items():
    if not dataset:
        print(f"❌ No data for configuration: {config}")
        continue
        
    print(f"\n🔍 Evaluating {config} using DSPy evaluate method...")
    print(f"📊 Dataset size: {len(dataset)} examples")
    
    try:
        # Run the evaluation - DSPy's evaluator already handles averaging
        score, outputs = evaluator(
            program=qa_program,
            devset=dataset,
            return_outputs=True
        )
        
        dspy_eval_results[config] = {
            'score': score,
            'count': len(dataset),
            'outputs': outputs  # Store the individual outputs
        }
        
        print(f"✅ {config.replace('_', ' ').title()}")
        print(f"   SemanticF1 Score: {score:.4f}")
        print(f"   Examples:         {len(dataset)}")
        print(f"   Outputs returned: {len(outputs) if outputs else 0}")
        
    except Exception as e:
        print(f"❌ Error evaluating {config}: {e}")
        import traceback
        traceback.print_exc()

print("\n" + "=" * 80)
print("DSPy EVALUATE SUMMARY")
print("=" * 80)
print(f"{'Configuration':<18} | {'SemanticF1':<12} | {'Count':<5}")
print("-" * 40)

for config, result in dspy_eval_results.items():
    config_name = config.replace('_', ' ').title()
    # Convert score from percentage (0-100) to decimal (0-1) format
    score_decimal = result['score'] / 100 if result['score'] > 1 else result['score']
    print(f"{config_name:<18} | {score_decimal:<12.4f} | {result['count']:<5d} ")

print("=" * 80)

DSPy EVALUATE METHOD RESULTS WITH SEMANTICF1

🔍 Evaluating literal_context using DSPy evaluate method...
📊 Dataset size: 179 examples
  0%|          | 0/179 [00:00<?, ?it/s]

Average Metric: 76.64 / 179 (42.8%): 100%|██████████| 179/179 [00:00<00:00, 220.14it/s]

2025/08/21 16:48:39 INFO dspy.evaluate.evaluate: Average Metric: 76.63955449508343 / 179 (42.8%)





Unnamed: 0,context,question,example_response,pred_response,confidence,SemanticF1
0,Cannot GET /wiki/A%20N,who is freddy krueger?,Freddy Kruger is the nightmare in nighmare on Elm street. Please n...,Cannot GET /wiki/A%20N,0.22141,✔️ [0.500]
1,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,who was the star on this movie?,"Robert Englund IS Freddy Kruger, the bad guy for these films. Note...",20Nightmare,0.293904,
2,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,What is the movie about?,"Ok, here goes, I'm getting ""Cannot get""..so, Nightmare on Elm stre...",20film,0.059085,
3,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,Who directed the new film?,It was Directed by: Samuel Bayer. Note that the link here is broke...,2010%20film,0.392794,
4,"Bruce Wayne is born to Dr. Thomas Wayne and his wife Martha Kane ,...",Is the Batman comic similar to the movies?,"I would say the movie and comics has same story line, as Batmans p...",Gotham City socialites,0.193257,✔️ [0.400]


✅ Literal Context
   SemanticF1 Score: 42.8200
   Examples:         179
   Outputs returned: 179

🔍 Evaluating pragmatic_context using DSPy evaluate method...
📊 Dataset size: 179 examples
Average Metric: 69.69 / 179 (38.9%): 100%|██████████| 179/179 [00:00<00:00, 199.87it/s]

2025/08/21 16:48:40 INFO dspy.evaluate.evaluate: Average Metric: 69.68638737247102 / 179 (38.9%)





Unnamed: 0,context,question,example_response,pred_response,confidence,SemanticF1
0,Cannot GET /wiki/A%20N,who is freddy krueger?,Freddy Kruger is the nightmare in nighmare on Elm street. Please n...,Cannot GET /wiki/A%20N,0.22141,✔️ [0.500]
1,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,who was the star on this movie?,"Robert Englund IS Freddy Kruger, the bad guy for these films. Note...",20Nightmare,0.293904,
2,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,What is the movie about?,"Ok, here goes, I'm getting ""Cannot get""..so, Nightmare on Elm stre...",20film,0.405578,
3,Cannot GET /wiki/A%20Nightmare%20on%20Elm%20Street/A%20Nightmare%2...,Who directed the new film?,It was Directed by: Samuel Bayer. Note that the link here is broke...,2010%20film,0.392794,
4,"While returning home one night, his parents were killed by a small...",Is the Batman comic similar to the movies?,"I would say the movie and comics has same story line, as Batmans p...",his parents were killed by a small-time criminal named Joe Chill,0.480525,✔️ [0.400]


✅ Pragmatic Context
   SemanticF1 Score: 38.9300
   Examples:         179
   Outputs returned: 179

🔍 Evaluating retrieved_context using DSPy evaluate method...
📊 Dataset size: 179 examples
Average Metric: 16.20 / 179 (9.0%): 100%|██████████| 179/179 [00:56<00:00,  3.19it/s]

2025/08/21 16:49:37 INFO dspy.evaluate.evaluate: Average Metric: 16.199108518875796 / 179 (9.0%)





Unnamed: 0,context,question,example_response,pred_response,confidence,SemanticF1
0,Freddy Krueger General information Age ? (at the time of physical ...,who is freddy krueger?,Freddy Kruger is the nightmare in nighmare on Elm street. Please n...,Trivia\n \n\n\n\n Billy Bob Thornton,0.689301,
1,Johnny Depp Information Date of birth TBA Film(s) A Nightmare on E...,who was the star on this movie?,"Robert Englund IS Freddy Kruger, the bad guy for these films. Note...",Captain Jack Sparrow,0.954749,
2,"Samuel Bayer Information Date of birth February 17, 1965 (age 47) ...",What is the movie about?,"Ok, here goes, I'm getting ""Cannot get""..so, Nightmare on Elm stre...",A Nightmare on Elm Street,2.025873,✔️ [0.400]
3,"Samuel Bayer Information Date of birth February 17, 1965 (age 47) ...",Who directed the new film?,It was Directed by: Samuel Bayer. Note that the link here is broke...,Gore Verbinski,1.934716,
4,Warner Bros. is the film studio that owns DC Comics and the Batman...,Is the Batman comic similar to the movies?,"I would say the movie and comics has same story line, as Batmans p...",The most consistent,0.298756,


✅ Retrieved Context
   SemanticF1 Score: 9.0500
   Examples:         179
   Outputs returned: 179

DSPy EVALUATE SUMMARY
Configuration      | SemanticF1   | Count
----------------------------------------
Literal Context    | 0.4282       | 179   
Pragmatic Context  | 0.3893       | 179   
Retrieved Context  | 0.0905       | 179   


In [None]:
# Load the results form the JSON file (evaluation_results_distillbert.json)
with open("evaluation_results_distillbert.json", 'r', encoding='utf-8') as f:
    loaded_results = json.load(f)

# Display the results
for config, data in loaded_results.get("configurations", {}).items():
    scores = data.get("scores", {})
    print(f"\nConfiguration: {config.replace('_', ' ').title()}")
    print(f"  Precision: {scores.get('precision', 0.0):.4f}")
    print(f"  Recall:    {scores.get('recall', 0.0):.4f}")
    print(f"  F1:        {scores.get('f1', 0.0):.4f}")
    print(f"  Count:     {scores.get('count', 0)}")

# print the outputs for each configuration
gt_examples = loaded_results["configurations"]["retrieved_context"]["ground_truth_examples"]
litteral_pred_examples = loaded_results["configurations"]["literal_context"]["predicted_examples"]
pragmatic_pred_examples = loaded_results["configurations"]["pragmatic_context"]["predicted_examples"]
retrieved_pred_examples = loaded_results["configurations"]["retrieved_context"]["predicted_examples"]
print(f"\nDisplaying first {len(gt_examples)} examples from the loaded results:\n")
print("=" * 40)
for i in range(len(gt_examples)):
    print(f"Question {i+1}:")
    print(f"  Question: {gt_examples[i]['question']}")
    print(f"  Ground Truth Answer: {gt_examples[i]['response']}")
    print(f"  Literal Predicted Answer: {litteral_pred_examples[i]['response']}")
    print(f"  Pragmatic Predicted Answer: {pragmatic_pred_examples[i]['response']}")
    print(f"  Retrieved Predicted Answer: {retrieved_pred_examples[i]['response']}")
    print("=" * 40)


Configuration: Literal Context
  Precision: 0.4287
  Recall:    0.4287
  F1:        0.4287
  Count:     179

Configuration: Pragmatic Context
  Precision: 0.3092
  Recall:    0.3092
  F1:        0.3092
  Count:     179

Configuration: Retrieved Context
  Precision: 0.0846
  Recall:    0.0846
  F1:        0.0846
  Count:     179

Displaying first 179 examples from the loaded results:

Question 1:
  Question: who is freddy krueger?
  Ground Truth Answer: Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.
  Literal Predicted Answer: Cannot GET /wiki/A%20N
  Pragmatic Predicted Answer: Cannot GET /wiki/A%20N
  Retrieved Predicted Answer: Trivia
   



    Billy Bob Thornton
Question 2:
  Question: who was the star on this movie?
  Grou

## Analysis of Results: Model Performance 

#### **1. Literal Context Performance (Best: ~0.42 F1)**
-  ***The model performs best when given literal spans from the ground truth annotations***

#### **2. Pragmatic Context Performance (Moderate: ~0.38 F1)**  
- ***Model can extract some relevant information from pragmatic spans***
- ***Also this provide indication that the dataset include challenging pragmatic information***

===================================================================================
#### **3. Retrieved Context Performance (Poor: ~0.09 F1)**
- **Major Issues**: ***Dramatic performance drop indicates severe retrieval-QA pipeline problems***
- **Root Causes**: 
  - ***Retrieved documents may not contain the exact answer spans needed for extractive QA***
  - ***The model may lack the ability to effectevely extract from the retrieved context***

#### **Retrieval-QA Pipeline Failure:**
- ***The 80% performance drop from literal to retrieved context indicates a critical system failure***

#### **Analyzis of the retrieval-QA pipeline**
 - ***The model failed on providing correct answers in almost 99% of the time .***
 - ***All of the answers were pure litteral answers with no pragmatic information at all , even when a more pragmatic answer is needed***

===================================================================================
#### **Literal vs Pragmatic Bias:**
- ***The model shows strong bias toward **literal extraction** over pragmatic reasoning***

#### **PragmatiCQA Challenge Validation:**
- ***Results confirm the dataset's challenge: traditional QA models miss >90% of pragmatic information***
- ***The gap between literal (42.8%) and pragmatic (30.9%) performance validates the need for conversational QA approaches***


# **4.4 Part 2**


## **4.4.1 First Questions Evaluation**



In [None]:
# Configure LLM for DSPy
import dspy

# Set up Ollama LLM for our multi-step prompting approach
# llm = dspy.LM('ollama/qwen3:4b-instruct', api_base='http://localhost:11434', api_key='')
llm = dspy.LM('xai/grok-3-mini', api_key="api_key_here")

# Configure DSPy to use our LLM
dspy.configure(lm=llm)

print("✅ LLM configured for DSPy")
print(f"Model: {llm.model}")

test_response = llm("What is the capital of France?")
print(f"\n🧪 Test response: {test_response[0]}")

✅ LLM configured for DSPy
Model: xai/grok-3-mini

🧪 Test response: The capital of France is Paris. It's not only the political and cultural heart of the country but also one of the most iconic cities in the world! If you have any more questions about France or anything else, feel free to ask. 😊


In [35]:
datasets_for_mapping = pcqa_test + pcqa_val + pcqa_train
corpuses_mapping = make_mapping(dataset=datasets_for_mapping)

✓ Loaded 406 docs for community: 'The Legend of Zelda''
✓ Loaded 476 docs for community: 'LEGO''
✓ Loaded 465 docs for community: 'Kung Fu Panda''
✓ Loaded 392 docs for community: 'Peanuts Comics''
✓ Loaded 495 docs for community: 'Mystery Science Theater 3000''
✓ Loaded 310 docs for community: 'Britney Spears''
✓ Loaded 165 docs for community: 'Throne of Glass''
✓ Loaded 499 docs for community: 'Studio Ghibli''
✓ Loaded 484 docs for community: 'Fallout''
✓ Loaded 498 docs for community: 'Baseball''
✓ Loaded 250 docs for community: 'A Nightmare on Elm Street''
✓ Loaded 496 docs for community: 'Batman''
✓ Loaded 46 docs for community: 'Supernanny''
✓ Loaded 257 docs for community: 'Hamilton the Musical''
✓ Loaded 499 docs for community: 'Wizard of Oz''
✓ Loaded 367 docs for community: 'Jujutsu Kaisen''
✓ Loaded 195 docs for community: 'Enter the Gungeon''
✓ Loaded 499 docs for community: 'Dinosaur''
✓ Loaded 250 docs for community: 'The Karate Kid''
✓ Loaded 495 docs for community: 'Pop

In [23]:
# Create retrievers for all communities
print("Creating retrievers for all communities...")
retrievers = create_retrievers(corpuses_mapping, embedder, k=3)
print(f"Created {len(retrievers)} retrievers")

Creating retrievers for all communities...
Created 54 retrievers


#### ***NOTE : I tried the cooperative_question and then retrieve more context based on it but the api inputs token is not as sufficient so it gave me errors***
#### ***Also changed k to k=3 + I put max_chars_per_passage = min(10000 or len(passage))***


## **RAG with LLM**

In [24]:
import dspy

class LLM_RAG(dspy.Module):
    """Advanced LLM-based generative QA module using 2-turn multi-step prompting"""
    
    def __init__(self):
        super().__init__()
        
        # First turn: Generate intermediary reasoning fields
        self.intermediary_reasoning = dspy.ChainOfThought(
            "context, question, conversation_history -> student_goal, pragmatic_need",
            doc="Analyze student goals and pragmatic needs for better understanding."
        )
        
        # Second turn: Generate final answer using intermediary reasoning
        self.final_answer = dspy.ChainOfThought(
            "context, question, conversation_history, student_goal, pragmatic_need -> response",
            doc="Generate comprehensive answer based on context and intermediary reasoning analysis."
        )
    
    def forward(self, context, question, community , conversation_history=""):
        """Generate answer using 2-turn multi-step prompting"""
        try:
            # Turn 1: Generate intermediary reasoning fields
            intermediary_result = self.intermediary_reasoning(
                context=context,
                question=question,
                conversation_history=conversation_history or "No prior conversation context"
            )
            
            # Turn 2: Generate final answer using intermediary reasoning

            # additional_retrieved_context = get_retrieved_context(intermediary_result.cooperative_question, community, retrievers)
            final_result = self.final_answer(
                context=context,
                question=question,
                conversation_history=conversation_history or "No prior conversation context",
                student_goal=intermediary_result.student_goal,
                pragmatic_need=intermediary_result.pragmatic_need
                # cooperative_question_retrieved_context=additional_retrieved_context
            )
            
            # Return prediction with all fields for analysis
            return dspy.Prediction(
                response=final_result.response,
                student_goal=intermediary_result.student_goal,
                pragmatic_need=intermediary_result.pragmatic_need,
                # cooperative_question=intermediary_result.cooperative_question,
                # cooperative_question_retrieved_context=additional_retrieved_context,
                conversation_history = conversation_history,
                reasoning=f"Turn 1: {intermediary_result.rationale if hasattr(intermediary_result, 'rationale') else 'Intermediary reasoning generated'} | Turn 2: {final_result.rationale if hasattr(final_result, 'rationale') else 'Final answer reasoning applied'}"
            )
            
        except Exception as e:
            print(f"Error in LLM_RAG: {e}")
            return dspy.Prediction(
                response="I cannot provide an answer based on the given context.",
                student_goal="Unknown",
                pragmatic_need="Unable to determine",
                # cooperative_question_retrieved_context=question,
                reasoning="Error occurred during processing"
            )


llm_rag = LLM_RAG()
print("✅ LLM_RAG module created successfully!")


✅ LLM_RAG module created successfully!


In [24]:
test_conv = pcqa_val[0]  # Get the first question from the test set
pprint.pprint(test_conv['qas'][0])  

{'a': 'Freddy Kruger is the nightmare in nighmare on Elm street. Please note, '
      'and to be very clear, the system that loads up wiki is not allowing '
      "access to Adam Prag, to the page... so I'll have to go from memory.  "
      'Normally you can paste things and back up what you are saying, but '
      "today that's not happening. alas.",
 'a_meta': {'literal_obj': [{'endKey': None,
                             'startKey': None,
                             'text': 'Cannot GET /wiki/A%20N'}],
            'pragmatic_obj': [{'endKey': None,
                               'startKey': None,
                               'text': 'Cannot GET /wiki/A%20N'}]},
 'human_eval': ['1', '1', '1', '1', '1'],
 'q': 'who is freddy krueger?'}


In [25]:
question = test_conv['qas'][0]['q']
community = test_conv['community']
context = get_retrieved_context(question, community, retrievers)
result = llm_rag(
    context=context,
    question=question,
    community=community,
    conversation_history=""  # No prior conversation context for this test
)

In [26]:
pprint.pprint(result)

Prediction(
    response='Freddy Krueger is the main antagonist in the A Nightmare on Elm Street horror franchise, particularly highlighted in the 2010 reboot. Originally, he was a human groundskeeper at a preschool who was secretly a child molester, preying on young children in his community. After being discovered and killed by the parents of his victims in a vigilante act, Freddy transformed into a vengeful supernatural entity known as a "Dream Demon." This allows him to invade people\'s dreams and kill them in ways that manifest real-world injuries, seeking revenge on the teenagers he blames for his demise.\n\nIn the series, Freddy is significant as a horror trope representing the "undead slasher" archetype— a once-human monster who embodies repressed fears and societal guilt. Unlike some portrayals, the reboot version is more serious and realistic, focusing on his burned appearance and psychological terror rather than humor. This character explores themes like the inescapability o

### **Evaluating using dspy.Evaluate method**

In [41]:
from dspy.evaluate import Evaluate , SemanticF1
from tqdm import tqdm
print("🔄 Starting DSPy evaluation for LLM_RAG on first questions...")


semantic_f1 = SemanticF1()
# Prepare evaluation dataset for first questions in validation set
eval_examples = []
for conv in tqdm(pcqa_val, desc="Preparing first questions for LLM_RAG evaluation"):
    qa_item = conv['qas'][0]
    question = qa_item['q']
    ground_truth = qa_item['a']
    community = conv['community']
    context = get_retrieved_context_maxchars(question, community, retrievers)
    example = dspy.Example(
        context=context,
        question=question,
        community=community,
        conversation_history="",  # No prior conversation context for first turn
        response=ground_truth
    ).with_inputs('context', 'question', 'community', 'conversation_history')
    eval_examples.append(example)

# Set up the evaluator
llmrag_evaluator = Evaluate(
    devset=None,
    metric=semantic_f1,
    num_threads=24,
    display_progress=True,
    display_table=5,
)

# Run evaluation
score, outputs = llmrag_evaluator(
    program=llm_rag,
    devset=eval_examples,
    return_outputs=True
)

print(f"\n✅ LLM_RAG SemanticF1 Score: {score:.4f}")
print(f"Total examples evaluated: {len(eval_examples)}")
print(f"Sample outputs:")
for i, output in enumerate(outputs[:3]):
    print(f"Example {i+1}:")
    print(output)


🔄 Starting DSPy evaluation for LLM_RAG on first questions...


Preparing first questions for LLM_RAG evaluation: 100%|██████████| 179/179 [00:20<00:00,  8.95it/s]


  0%|          | 0/179 [00:00<?, ?it/s]



Average Metric: 0.00 / 1 (0.0%):   0%|          | 0/179 [00:00<?, ?it/s]



Average Metric: 0.79 / 3 (26.5%):   1%|          | 2/179 [00:00<00:00, 500.07it/s]



Average Metric: 1.89 / 5 (37.8%):   2%|▏         | 4/179 [00:00<00:00, 533.30it/s]



Average Metric: 2.68 / 6 (44.6%):   3%|▎         | 5/179 [00:00<00:00, 416.48it/s]



Average Metric: 2.68 / 7 (38.3%):   3%|▎         | 6/179 [00:00<00:00, 428.43it/s]



Average Metric: 2.68 / 8 (33.5%):   4%|▍         | 7/179 [00:00<00:00, 424.08it/s]



Average Metric: 4.14 / 10 (41.4%):   5%|▌         | 9/179 [00:00<00:00, 310.28it/s]



Average Metric: 4.64 / 11 (42.2%):   6%|▌         | 10/179 [00:00<00:00, 338.91it/s]



Average Metric: 5.39 / 12 (44.9%):   6%|▌         | 11/179 [00:00<00:00, 354.76it/s]



Average Metric: 5.57 / 13 (42.9%):   7%|▋         | 12/179 [00:00<00:00, 342.80it/s]



Average Metric: 5.75 / 14 (41.1%):   7%|▋         | 13/179 [00:00<00:00, 271.72it/s]



Average Metric: 6.50 / 15 (43.4%):   8%|▊         | 14/179 [00:00<00:00, 261.66it/s]



Average Metric: 6.50 / 16 (40.6%):   8%|▊         | 15/179 [00:00<00:00, 267.84it/s]



Average Metric: 6.68 / 17 (39.3%):   9%|▉         | 16/179 [00:00<00:00, 271.13it/s]



Average Metric: 7.18 / 18 (39.9%):   9%|▉         | 17/179 [00:00<00:00, 276.37it/s]



Average Metric: 7.18 / 19 (37.8%):  10%|█         | 18/179 [00:00<00:00, 283.41it/s]



Average Metric: 7.63 / 20 (38.1%):  11%|█         | 19/179 [00:00<00:00, 290.02it/s]



Average Metric: 8.30 / 21 (39.5%):  11%|█         | 20/179 [00:00<00:00, 296.25it/s]



Average Metric: 8.83 / 23 (38.4%):  12%|█▏        | 22/179 [00:00<00:00, 321.12it/s]



Average Metric: 9.33 / 24 (38.9%):  13%|█▎        | 23/179 [00:00<00:00, 326.19it/s]



Average Metric: 9.33 / 25 (37.3%):  13%|█▎        | 24/179 [00:00<00:00, 337.97it/s]



Average Metric: 9.33 / 26 (35.9%):  14%|█▍        | 25/179 [00:00<00:00, 304.81it/s]



Average Metric: 9.90 / 27 (36.7%):  15%|█▍        | 26/179 [00:00<00:00, 236.29it/s]



Average Metric: 10.40 / 28 (37.1%):  15%|█▌        | 27/179 [00:00<00:00, 236.29it/s]



Average Metric: 10.90 / 30 (36.3%):  16%|█▌        | 29/179 [00:00<00:00, 236.29it/s]



Average Metric: 11.30 / 31 (36.5%):  17%|█▋        | 30/179 [00:00<00:00, 236.29it/s]



Average Metric: 12.62 / 33 (38.2%):  18%|█▊        | 32/179 [00:00<00:00, 236.29it/s]



Average Metric: 13.15 / 34 (38.7%):  18%|█▊        | 33/179 [00:00<00:00, 236.29it/s]



Average Metric: 13.15 / 35 (37.6%):  19%|█▉        | 34/179 [00:00<00:00, 236.29it/s]



Average Metric: 13.71 / 36 (38.1%):  20%|█▉        | 35/179 [00:00<00:00, 236.29it/s]



Average Metric: 13.99 / 37 (37.8%):  20%|██        | 36/179 [00:00<00:00, 236.29it/s]



Average Metric: 14.33 / 38 (37.7%):  21%|██        | 37/179 [00:00<00:00, 236.29it/s]



Average Metric: 15.33 / 40 (38.3%):  22%|██▏       | 39/179 [00:00<00:00, 236.29it/s]



Average Metric: 15.86 / 42 (37.8%):  23%|██▎       | 41/179 [00:00<00:00, 236.29it/s]



Average Metric: 16.04 / 43 (37.3%):  23%|██▎       | 42/179 [00:00<00:00, 236.29it/s]



Average Metric: 16.44 / 44 (37.4%):  24%|██▍       | 43/179 [00:00<00:00, 236.29it/s]



Average Metric: 16.69 / 46 (36.3%):  25%|██▌       | 45/179 [00:00<00:00, 236.29it/s]



Average Metric: 16.98 / 48 (35.4%):  26%|██▋       | 47/179 [00:00<00:00, 236.29it/s]



Average Metric: 17.21 / 49 (35.1%):  27%|██▋       | 48/179 [00:00<00:00, 236.29it/s]



Average Metric: 17.44 / 50 (34.9%):  27%|██▋       | 49/179 [00:00<00:00, 236.29it/s]



Average Metric: 18.29 / 52 (35.2%):  28%|██▊       | 51/179 [00:00<00:00, 236.29it/s]



Average Metric: 18.79 / 53 (35.5%):  29%|██▉       | 52/179 [00:00<00:00, 236.29it/s]



Average Metric: 19.47 / 55 (35.4%):  30%|███       | 54/179 [00:00<00:00, 236.29it/s]



Average Metric: 20.13 / 56 (35.9%):  31%|███       | 55/179 [00:00<00:00, 236.29it/s]



Average Metric: 20.13 / 56 (35.9%):  31%|███▏      | 56/179 [00:00<00:00, 272.23it/s]



Average Metric: 20.31 / 57 (35.6%):  31%|███▏      | 56/179 [00:00<00:00, 272.23it/s]



Average Metric: 20.71 / 59 (35.1%):  32%|███▏      | 58/179 [00:00<00:00, 272.23it/s]



Average Metric: 21.38 / 60 (35.6%):  33%|███▎      | 59/179 [00:00<00:00, 272.23it/s]



Average Metric: 21.74 / 61 (35.6%):  34%|███▎      | 60/179 [00:00<00:00, 272.23it/s]



Average Metric: 22.17 / 62 (35.8%):  34%|███▍      | 61/179 [00:00<00:00, 272.23it/s]



Average Metric: 22.54 / 63 (35.8%):  35%|███▍      | 62/179 [00:00<00:00, 272.23it/s]



Average Metric: 22.88 / 65 (35.2%):  36%|███▌      | 64/179 [00:00<00:00, 272.23it/s]



Average Metric: 23.40 / 66 (35.5%):  36%|███▋      | 65/179 [00:00<00:00, 272.23it/s]



Average Metric: 24.45 / 68 (36.0%):  37%|███▋      | 67/179 [00:00<00:00, 272.23it/s]



Average Metric: 25.00 / 69 (36.2%):  38%|███▊      | 68/179 [00:00<00:00, 272.23it/s]



Average Metric: 25.92 / 71 (36.5%):  39%|███▉      | 70/179 [00:00<00:00, 272.23it/s]



Average Metric: 26.49 / 72 (36.8%):  40%|███▉      | 71/179 [00:00<00:00, 272.23it/s]



Average Metric: 26.69 / 73 (36.6%):  40%|████      | 72/179 [00:00<00:00, 272.23it/s]



Average Metric: 27.19 / 75 (36.3%):  41%|████▏     | 74/179 [00:00<00:00, 272.23it/s]



Average Metric: 28.55 / 77 (37.1%):  42%|████▏     | 76/179 [00:00<00:00, 272.23it/s]



Average Metric: 29.09 / 78 (37.3%):  43%|████▎     | 77/179 [00:00<00:00, 272.23it/s]



Average Metric: 29.09 / 79 (36.8%):  44%|████▎     | 78/179 [00:00<00:00, 272.23it/s]



Average Metric: 29.09 / 80 (36.4%):  44%|████▍     | 79/179 [00:00<00:00, 272.23it/s]



Average Metric: 29.09 / 81 (35.9%):  45%|████▍     | 80/179 [00:00<00:00, 272.23it/s]



Average Metric: 29.43 / 82 (35.9%):  45%|████▌     | 81/179 [00:00<00:00, 272.23it/s]



Average Metric: 30.51 / 84 (36.3%):  46%|████▋     | 83/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.36 / 86 (36.5%):  47%|████▋     | 85/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.36 / 87 (36.1%):  48%|████▊     | 86/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.36 / 88 (35.6%):  49%|████▊     | 87/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.56 / 89 (35.5%):  49%|████▉     | 88/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.70 / 90 (35.2%):  50%|████▉     | 89/179 [00:00<00:00, 272.23it/s]



Average Metric: 31.70 / 91 (34.8%):  50%|█████     | 90/179 [00:00<00:00, 272.23it/s]



Average Metric: 32.37 / 92 (35.2%):  51%|█████     | 91/179 [00:00<00:00, 272.23it/s]



Average Metric: 32.37 / 93 (34.8%):  51%|█████▏    | 92/179 [00:00<00:00, 272.23it/s]



Average Metric: 33.06 / 94 (35.2%):  52%|█████▏    | 93/179 [00:00<00:00, 272.23it/s]



Average Metric: 33.51 / 95 (35.3%):  53%|█████▎    | 94/179 [00:00<00:00, 272.23it/s]



Average Metric: 57.86 / 179 (32.3%): 100%|██████████| 179/179 [00:00<00:00, 499.82it/s]

2025/08/22 15:03:51 INFO dspy.evaluate.evaluate: Average Metric: 57.86063804484368 / 179 (32.3%)





Unnamed: 0,context,question,community,example_conversation_history,example_response,pred_response,student_goal,pragmatic_need,pred_conversation_history,reasoning,SemanticF1
0,Freddy Krueger General information Age ? (at the time of physical ...,who is freddy krueger?,A Nightmare on Elm Street,,Freddy Kruger is the nightmare in nighmare on Elm street. Please n...,Freddy Krueger is the main antagonist in the A Nightmare on Elm St...,"To understand the character of Freddy Krueger, including his origi...",To provide a factual and engaging summary of Freddy Krueger's iden...,,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.182]
1,Johnny Depp Information Date of birth TBA Film(s) A Nightmare on E...,who was the star on this movie?,A Nightmare on Elm Street,,"Robert Englund IS Freddy Kruger, the bad guy for these films. Note...","In the movie ""A Nightmare on Elm Street,"" Johnny Depp starred as G...",To identify and understand the main actors in classic horror films...,To provide a direct and accurate identification of the movie's sta...,,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
2,"Samuel Bayer Information Date of birth February 17, 1965 (age 47) ...",What is the movie about?,A Nightmare on Elm Street,,"Ok, here goes, I'm getting ""Cannot get""..so, Nightmare on Elm stre...","""A Nightmare on Elm Street"" is a classic horror film that centers ...","To gain a clear understanding of the plot, main characters, and co...",To provide a straightforward and accurate summary of the movie's s...,,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.545]
3,"Samuel Bayer Information Date of birth February 17, 1965 (age 47) ...",Who directed the new film?,A Nightmare on Elm Street,,It was Directed by: Samuel Bayer. Note that the link here is broke...,"The director of the 2010 film ""A Nightmare on Elm Street"" is Samue...","To identify and understand key figures in film production, such as...",To provide a straightforward and accurate response to the user's q...,,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.667]
4,Warner Bros. is the film studio that owns DC Comics and the Batman...,Is the Batman comic similar to the movies?,Batman,,"I would say the movie and comics has same story line, as Batmans p...","Yes, Batman comics and movies share many core similarities, but th...","To compare and contrast the core elements, themes, and character p...",To evaluate whether the comics offer a similar or enhanced experie...,,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.181]



✅ LLM_RAG SemanticF1 Score: 32.3200
Total examples evaluated: 179
Sample outputs:
Example 1:
(Example({'context': '\n\n\n\n\n\n\n     Freddy Krueger\n    \n\n\n      General information\n     \n\n\n       Age\n      \n\n       ? (at the time of physical death)\n      \n\n\n\n       Relationships\n      \n\n       unknown\n      \n\n\n\n       Occupation\n      \n\n       Groundskeeper (when he was alive)\n       \n        Child Molester (when he was alive)\n        Serial Killer (confirmed after death, not stated either way when he was alive)\n       \n\n\n\n\n       Appearances\n      \n\n\n        2010 reboot\n       \n\n\n\n\n       Portrayal\n      \n\n\n        Jackie Earle Haley\n       \n\n\n\n\n\n\n   This version of\n   \n    Freddy Krueger\n   \n   is the main antagonist of the\n   \n\n     A Nightmare on Elm Street\n    \n\n\n    Reboot\n   \n   . In this film, Freddy appeared once again as a Dream Demon and a serial killer after his death.\n  \n\n   In life, he was a child

In [42]:
pprint.pprint(outputs[0][1])  # Print the first output for inspection
pprint.pprint(outputs[0][0])  # Print the first output for inspection


Prediction(
    response='Freddy Krueger is the main antagonist in the A Nightmare on Elm Street horror franchise, particularly highlighted in the 2010 reboot. Originally, he was a human groundskeeper at a preschool who was secretly a child molester, preying on young children in his community. After being discovered and killed by the parents of his victims in a vigilante act, Freddy transformed into a vengeful supernatural entity known as a "Dream Demon." This allows him to invade people\'s dreams and kill them in ways that manifest real-world injuries, seeking revenge on the teenagers he blames for his demise.\n\nIn the series, Freddy is significant as a horror trope representing the "undead slasher" archetype— a once-human monster who embodies repressed fears and societal guilt. Unlike some portrayals, the reboot version is more serious and realistic, focusing on his burned appearance and psychological terror rather than humor. This character explores themes like the inescapability o

### **Saving the results to a json file**

In [25]:

def save_outputs(outputs):
    results = []
    # print(outputs)
    for i,output in enumerate(outputs):
        # print(output)
        output_dict = {
            "question": output[0]["question"],
            "response": output[1]["response"],
            "context": output[0]["context"],
            "community": output[0]["community"],
            "conversation_history": output[0]["conversation_history"] ,
            "student_goal": output[1]["student_goal"] ,
            "pragmatic_need": output[1]["pragmatic_need"] ,
            # "cooperative_question":output[1]["cooperative_question"] if "cooperative_question" in output[1] else "",
            # "cooperative_question_retrieved_context":output[1]["cooperative_question_retrieved_context"],
        }
        results.append(output_dict)
    return results



In [None]:
outputs_dict = save_outputs(outputs)

import json
with open("llm_rag_outputs.jsonl", "w", encoding="utf-8") as f:
    json.dump(outputs_dict, f, indent=2, ensure_ascii=False)

print("✅ Saved outputs to llm_rag_outputs.json")

In [45]:
import json

with open("llm_rag_outputs.jsonl", "r", encoding="utf-8") as f:
    loaded_outputs = json.load(f)

print(f"Loaded {len(loaded_outputs)} outputs from llm_rag_outputs.jsonl")
# Optionally, print the first output for inspection
print(loaded_outputs[0]["question"])

Loaded 179 outputs from llm_rag_outputs.jsonl
who is freddy krueger?


In [46]:
predictions= []
gt_list = []
for gt_example,pred_example,_ in outputs:
    print(f"GROUND TRUTH : {gt_example['response']}")
    print(f"PREDICTION : {pred_example['response']}")
    print("\n")
    predictions.append(pred_example)
    gt_list.append(gt_example)

GROUND TRUTH : Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.
PREDICTION : Freddy Krueger is the main antagonist in the A Nightmare on Elm Street horror franchise, particularly highlighted in the 2010 reboot. Originally, he was a human groundskeeper at a preschool who was secretly a child molester, preying on young children in his community. After being discovered and killed by the parents of his victims in a vigilante act, Freddy transformed into a vengeful supernatural entity known as a "Dream Demon." This allows him to invade people's dreams and kill them in ways that manifest real-world injuries, seeking revenge on the teenagers he blames for his demise.

In the series, Freddy is significant as a horror trope representing th

### **Evaluating using the SemanticF1 method**
#### *NOTE: dspy.evaluate method gives the average score and does not provide precision , recall seperatley . And so I obtained the precision and recall using the SemanticF1 method. And since I alread have used the dspy.Evaluate method , the results were already cached so it was fast(instant).*

In [47]:
results = evaluate_results(gt_list, predictions)

Evaluating examples: 100%|██████████| 179/179 [00:00<00:00, 1944.73it/s]


In [48]:
import pandas as pd

df = pd.DataFrame([results])
display(df.style.hide(axis="index"))

precision,recall,f1,count
0.323244,0.323244,0.323244,179


### **Comparing results with the Distillbert model on the first questions**

In [26]:
# Compare the results with the Distillbert model on the first questions
### **Comparing results with the Distillbert model on the first questions**
import json
with open("evaluation_results_distillbert.json", "r", encoding="utf-8") as f:
    distilbert_results = json.load(f)
# Load llm_rag results
with open("llm_rag_outputs.jsonl", 'r', encoding='utf-8') as f:
    llm_rag_results = json.load(f)

gt_examples = distilbert_results["configurations"]["retrieved_context"]["ground_truth_examples"]
litteral_distillbert_pred_examples = distilbert_results["configurations"]["literal_context"]["predicted_examples"]
pragmatic_distillbert_pred_examples = distilbert_results["configurations"]["pragmatic_context"]["predicted_examples"]
retrieved_distillbert_pred_examples = distilbert_results["configurations"]["retrieved_context"]["predicted_examples"]

for i in range(len(gt_examples)):
    print(f"Question {i+1}:")
    print(f"  Question: {gt_examples[i]['question']}")
    print(f"  Ground Truth Answer: {gt_examples[i]['response']}")
    print(f"  Literal DistillBERT Predicted Answer: {litteral_distillbert_pred_examples[i]['response']}")
    print(f"  Pragmatic DistillBERT Predicted Answer: {pragmatic_distillbert_pred_examples[i]['response']}")
    print(f"  Retrieved DistillBERT Predicted Answer: {retrieved_distillbert_pred_examples[i]['response']}")
    print(f"  LLM_RAG Predicted Answer: {llm_rag_results[i]['response']}")
    print("=" * 40)
print(f"\nDisplaying first {len(gt_examples)} examples from the loaded results:\n")



Question 1:
  Question: who is freddy krueger?
  Ground Truth Answer: Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.
  Literal DistillBERT Predicted Answer: Cannot GET /wiki/A%20N
  Pragmatic DistillBERT Predicted Answer: Cannot GET /wiki/A%20N
  Retrieved DistillBERT Predicted Answer: Trivia
   



    Billy Bob Thornton
  LLM_RAG Predicted Answer: Freddy Krueger is the main antagonist in the A Nightmare on Elm Street horror franchise, particularly highlighted in the 2010 reboot. Originally, he was a human groundskeeper at a preschool who was secretly a child molester, preying on young children in his community. After being discovered and killed by the parents of his victims in a vigilante act, Freddy transformed into a vengefu

#### **Configuration: Literal Context**

-    ***Precision: 0.4287***
-    ***Recall:    0.4287***
-    ***F1:        0.4287***
-    ***Count:     179***

#### **Configuration: Pragmatic Context**

-    ***Precision: 0.3092***
-    ***Recall:    0.3092***
-    ***F1:        0.3092***
-    ***Count:     179***

#### **Configuration: Retrieved Context**

-    ***Precision: 0.0846***
-    ***Recall:    0.0846***
-    ***F1:        0.0846***
-    ***Count:     179***

#### **Configuration: RAG with LLM**

-    ***Precision: 0.323244***
-    ***Recall:    0.323244***
-   ***F1:        0.323244***
-    ***Count:     179***

### **Analyzing the results**
##### 1.    *The LLM-based RAG approach provided more pragmatic answers than the Distillbert with retriever model in all of th answers.*

##### 2.    *From my presonal observation of the answers , the LLM based RAG approach provided more complete answers with more pragmatic information than the original ground truth answers.*

##### 3.    *Also from my analysis of the answers , the LLM based RAG approach provided more complete answers both litterally and pragmatically than the Distillbert with litteral/pragmatic spans as context.*

##### 4. *There are multiple possible follow-up questions that the LLM based RAG approach answers , but they may not be the exact follow-up questions that the ground truth pragmatic answers are aiming to answer. But still they are valid follow-up questions that can be asked in the conversation.*

# **4.4.2**

### **Creating examples with conversational history(using all the val.jsonl)**

In [28]:
from dspy.evaluate import Evaluate , SemanticF1
from tqdm import tqdm
print("🔄 Starting DSPy evaluation for LLM_RAG on first questions...")

def create_samples(input_qa_set):
    eval_examples = []
    for conv in tqdm(input_qa_set, desc="Preparing first questions for LLM_RAG evaluation"):
        conversation_history = []
        for qa_item in conv['qas']:
            question = qa_item['q']
            ground_truth = qa_item['a']
            community = conv['community']
            context = get_retrieved_context_maxchars(question, community, retrievers)
            # print(len(context) , len(context)/3)
            example = dspy.Example(
                context=context,
                question=question,
                community=community,
                conversation_history= " || ".join([f"Q: {ch['Q']} A: {ch['A']}" for ch in conversation_history]) if conversation_history else "No prior conversation context",  
                response=ground_truth
            ).with_inputs('context', 'question', 'community', 'conversation_history')
            eval_examples.append(example)
            conversation_history.append({"Q": question, "A": ground_truth})
    return eval_examples

eval_examples = create_samples(pcqa_val)
# Set up the evaluator
print(f"\n📊 Total examples for multi-turn evaluation: {len(eval_examples)}")

🔄 Starting DSPy evaluation for LLM_RAG on first questions...


Preparing first questions for LLM_RAG evaluation: 100%|██████████| 179/179 [02:53<00:00,  1.03it/s]


📊 Total examples for multi-turn evaluation: 1526





In [29]:
pprint.pprint(eval_examples[0]["conversation_history"])
print("=" * 80)
pprint.pprint(eval_examples[1]["conversation_history"])
print("=" * 80)
pprint.pprint(eval_examples[2]["conversation_history"])
print("=" * 80)
pprint.pprint(eval_examples[3]["conversation_history"])

'No prior conversation context'
('Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nighmare on '
 'Elm street. Please note, and to be very clear, the system that loads up wiki '
 "is not allowing access to Adam Prag, to the page... so I'll have to go from "
 'memory.  Normally you can paste things and back up what you are saying, but '
 "today that's not happening. alas.")
('Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nighmare on '
 'Elm street. Please note, and to be very clear, the system that loads up wiki '
 "is not allowing access to Adam Prag, to the page... so I'll have to go from "
 'memory.  Normally you can paste things and back up what you are saying, but '
 "today that's not happening. alas. || Q: oh man, that sucks. A: Yes and no, "
 'it means I can be lighting quick, especially since I type quickly, and it '
 "means you'll make a higher hourly. Let's get cash. lol.")
('Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nighmare

In [30]:
compile_examples = create_samples(pcqa_train[:4])  # Use only first 4 conversations from training set for compilation
print(f"\n📊 Total examples for compilation: {len(compile_examples)}")

Preparing first questions for LLM_RAG evaluation: 100%|██████████| 4/4 [00:02<00:00,  1.53it/s]


📊 Total examples for compilation: 22





In [31]:
pprint.pprint(compile_examples[0]["conversation_history"])
print("=" * 80)
pprint.pprint(compile_examples[1]["conversation_history"])
print("=" * 80)
pprint.pprint(compile_examples[2]["conversation_history"])
print("=" * 80)


'No prior conversation context'
('Q: how old is snoop dogg? A: Born in California on October 20, 1971, Snoop '
 'Dog is now 46 years old')
('Q: how old is snoop dogg? A: Born in California on October 20, 1971, Snoop '
 'Dog is now 46 years old || Q: I see.. has he ever been incarcerated? A: Yes, '
 'after graduating from high school Snoop Dogg was arrested for narcotics '
 'possession and spent the next 3 years in and out of prison. As a teenager he '
 'frequently ran into trouble with the law.')


### **Optimizing the RAG with LLM using Bootstrapfewshot**

In [None]:
import dspy
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(
        metric=semantic_f1,
        max_bootstrapped_demos=8,
        max_labeled_demos=16
    )
    
optimized_llm_rag = optimizer.compile(llm_rag, trainset=compile_examples)
print("✅ Optimization completed successfully!")

100%|██████████| 22/22 [00:00<00:00, 195.48it/s]

Bootstrapped 5 full traces after 21 examples for up to 1 rounds, amounting to 22 attempts.
✅ Optimization completed successfully!





### **Why this metric ?**
- ##### *I used the semanticF1 metric to compile and optimize the performance , because it provides a more nuanced evaluation of the model's ability to capture the semantic meaning of the answers. This is important in pragmatic question answering, where the goal is to understand and respond to the intent of the question, which may not always be reflected in a literal answer. The SemanticF1 metric allows for a better assessment of how well the model is performing in this context.*

### **Evaluating the model**

In [33]:
semantic_f1 = SemanticF1()

llmrag_evaluator = Evaluate(
    devset=None,
    metric=semantic_f1,
    num_threads=24,
    display_progress=True,
    display_table=10,
)

# Run evaluation
score, outputs = llmrag_evaluator(
    program=llm_rag,
    devset=eval_examples,
    return_outputs=True
)



  0%|          | 0/1526 [00:00<?, ?it/s]



Average Metric: 0.18 / 1 (18.2%):   0%|          | 0/1526 [00:00<?, ?it/s]



Average Metric: 0.36 / 2 (18.1%):   0%|          | 1/1526 [00:00<00:48, 31.72it/s]



Average Metric: 0.36 / 3 (12.0%):   0%|          | 2/1526 [00:00<00:40, 38.06it/s]



Average Metric: 0.36 / 6 (6.0%):   0%|          | 5/1526 [00:00<00:19, 76.27it/s] 



Average Metric: 2.11 / 13 (16.2%):   1%|          | 12/1526 [00:00<00:13, 108.31it/s]



Average Metric: 2.29 / 14 (16.4%):   1%|          | 13/1526 [00:00<00:13, 108.31it/s]



Average Metric: 2.54 / 15 (16.9%):   1%|          | 14/1526 [00:00<00:13, 108.31it/s]



Average Metric: 2.88 / 16 (18.0%):   1%|          | 15/1526 [00:00<00:13, 108.31it/s]



Average Metric: 3.53 / 18 (19.6%):   1%|          | 17/1526 [00:00<00:13, 108.31it/s]



Average Metric: 4.26 / 20 (21.3%):   1%|          | 19/1526 [00:00<00:13, 108.31it/s]



Average Metric: 5.81 / 24 (24.2%):   2%|▏         | 23/1526 [00:00<00:13, 108.31it/s]



Average Metric: 8.29 / 33 (25.1%):   2%|▏         | 32/1526 [00:00<00:13, 108.31it/s]



Average Metric: 8.29 / 33 (25.1%):   2%|▏         | 33/1526 [00:00<00:08, 170.74it/s]



Average Metric: 8.69 / 34 (25.5%):   2%|▏         | 33/1526 [00:00<00:08, 170.74it/s]



Average Metric: 9.05 / 35 (25.9%):   2%|▏         | 34/1526 [00:00<00:08, 170.74it/s]



Average Metric: 9.93 / 38 (26.1%):   2%|▏         | 37/1526 [00:00<00:08, 170.74it/s]



Average Metric: 12.56 / 46 (27.3%):   3%|▎         | 45/1526 [00:00<00:08, 170.74it/s]



Average Metric: 12.73 / 47 (27.1%):   3%|▎         | 46/1526 [00:00<00:08, 170.74it/s]



Average Metric: 13.57 / 49 (27.7%):   3%|▎         | 48/1526 [00:00<00:08, 170.74it/s]



Average Metric: 14.52 / 51 (28.5%):   3%|▎         | 50/1526 [00:00<00:08, 170.74it/s]



Average Metric: 15.46 / 53 (29.2%):   3%|▎         | 52/1526 [00:00<00:08, 165.46it/s]



Average Metric: 17.14 / 57 (30.1%):   4%|▎         | 56/1526 [00:00<00:08, 165.46it/s]



Average Metric: 17.74 / 58 (30.6%):   4%|▎         | 57/1526 [00:00<00:08, 165.46it/s]



Average Metric: 20.37 / 67 (30.4%):   4%|▍         | 66/1526 [00:00<00:08, 165.46it/s]



Average Metric: 20.37 / 68 (30.0%):   4%|▍         | 67/1526 [00:00<00:08, 165.46it/s]



Average Metric: 20.50 / 70 (29.3%):   5%|▍         | 69/1526 [00:00<00:08, 165.46it/s]



Average Metric: 20.75 / 71 (29.2%):   5%|▍         | 70/1526 [00:00<00:08, 165.46it/s]



Average Metric: 21.32 / 72 (29.6%):   5%|▍         | 71/1526 [00:00<00:08, 165.46it/s]



Average Metric: 21.32 / 72 (29.6%):   5%|▍         | 72/1526 [00:00<00:08, 178.34it/s]



Average Metric: 21.32 / 73 (29.2%):   5%|▍         | 72/1526 [00:00<00:08, 178.34it/s]



Average Metric: 22.22 / 75 (29.6%):   5%|▍         | 74/1526 [00:00<00:08, 178.34it/s]



Average Metric: 23.14 / 77 (30.1%):   5%|▍         | 76/1526 [00:00<00:08, 178.34it/s]



Average Metric: 23.31 / 78 (29.9%):   5%|▌         | 77/1526 [00:00<00:08, 178.34it/s]



Average Metric: 24.04 / 80 (30.0%):   5%|▌         | 79/1526 [00:00<00:08, 178.34it/s]



Average Metric: 24.54 / 81 (30.3%):   5%|▌         | 80/1526 [00:00<00:08, 178.34it/s]



Average Metric: 24.87 / 82 (30.3%):   5%|▌         | 81/1526 [00:00<00:08, 178.34it/s]



Average Metric: 25.07 / 83 (30.2%):   5%|▌         | 82/1526 [00:00<00:08, 178.34it/s]



Average Metric: 25.07 / 84 (29.8%):   5%|▌         | 83/1526 [00:00<00:08, 178.34it/s]



Average Metric: 25.47 / 85 (30.0%):   6%|▌         | 84/1526 [00:00<00:08, 178.34it/s]



Average Metric: 25.68 / 86 (29.9%):   6%|▌         | 85/1526 [00:00<00:08, 178.34it/s]



Average Metric: 26.34 / 87 (30.3%):   6%|▌         | 86/1526 [00:00<00:08, 178.34it/s]



Average Metric: 26.63 / 88 (30.3%):   6%|▌         | 87/1526 [00:00<00:08, 178.34it/s]



Average Metric: 26.91 / 89 (30.2%):   6%|▌         | 88/1526 [00:00<00:08, 178.34it/s]



Average Metric: 27.53 / 91 (30.3%):   6%|▌         | 90/1526 [00:00<00:08, 178.34it/s]



Average Metric: 28.03 / 92 (30.5%):   6%|▌         | 91/1526 [00:00<00:07, 180.88it/s]



Average Metric: 28.03 / 93 (30.1%):   6%|▌         | 92/1526 [00:00<00:07, 180.88it/s]



Average Metric: 28.36 / 94 (30.2%):   6%|▌         | 93/1526 [00:00<00:07, 180.88it/s]



Average Metric: 29.13 / 95 (30.7%):   6%|▌         | 94/1526 [00:00<00:07, 180.88it/s]



Average Metric: 29.35 / 96 (30.6%):   6%|▌         | 95/1526 [00:00<00:07, 180.88it/s]



Average Metric: 30.25 / 98 (30.9%):   6%|▋         | 97/1526 [00:00<00:07, 180.88it/s]



Average Metric: 30.54 / 99 (30.8%):   6%|▋         | 98/1526 [00:00<00:07, 180.88it/s]



Average Metric: 31.00 / 101 (30.7%):   7%|▋         | 100/1526 [00:00<00:07, 180.88it/s]



Average Metric: 31.50 / 102 (30.9%):   7%|▋         | 101/1526 [00:00<00:07, 180.88it/s]



Average Metric: 32.36 / 104 (31.1%):   7%|▋         | 103/1526 [00:00<00:07, 180.88it/s]



Average Metric: 32.76 / 105 (31.2%):   7%|▋         | 104/1526 [00:00<00:07, 180.88it/s]



Average Metric: 32.98 / 106 (31.1%):   7%|▋         | 105/1526 [00:00<00:07, 180.88it/s]



Average Metric: 33.31 / 107 (31.1%):   7%|▋         | 106/1526 [00:00<00:07, 180.88it/s]



Average Metric: 33.80 / 109 (31.0%):   7%|▋         | 108/1526 [00:00<00:07, 180.88it/s]



Average Metric: 34.41 / 111 (31.0%):   7%|▋         | 110/1526 [00:00<00:07, 180.88it/s]



Average Metric: 34.81 / 112 (31.1%):   7%|▋         | 111/1526 [00:00<00:07, 184.73it/s]



Average Metric: 35.10 / 113 (31.1%):   7%|▋         | 112/1526 [00:00<00:07, 184.73it/s]



Average Metric: 36.27 / 115 (31.5%):   7%|▋         | 114/1526 [00:00<00:07, 184.73it/s]



Average Metric: 36.27 / 116 (31.3%):   8%|▊         | 115/1526 [00:00<00:07, 184.73it/s]



Average Metric: 36.93 / 117 (31.6%):   8%|▊         | 116/1526 [00:00<00:07, 184.73it/s]



Average Metric: 36.93 / 119 (31.0%):   8%|▊         | 118/1526 [00:00<00:07, 184.73it/s]



Average Metric: 37.37 / 121 (30.9%):   8%|▊         | 120/1526 [00:00<00:07, 184.73it/s]



Average Metric: 37.73 / 122 (30.9%):   8%|▊         | 121/1526 [00:00<00:07, 184.73it/s]



Average Metric: 37.73 / 123 (30.7%):   8%|▊         | 122/1526 [00:00<00:07, 184.73it/s]



Average Metric: 37.73 / 125 (30.2%):   8%|▊         | 124/1526 [00:00<00:07, 184.73it/s]



Average Metric: 38.24 / 127 (30.1%):   8%|▊         | 126/1526 [00:00<00:07, 184.73it/s]



Average Metric: 38.57 / 128 (30.1%):   8%|▊         | 127/1526 [00:00<00:07, 184.73it/s]



Average Metric: 38.57 / 129 (29.9%):   8%|▊         | 128/1526 [00:00<00:07, 184.73it/s]



Average Metric: 39.24 / 130 (30.2%):   8%|▊         | 129/1526 [00:00<00:07, 184.73it/s]



Average Metric: 39.64 / 131 (30.3%):   9%|▊         | 130/1526 [00:00<00:07, 184.73it/s]



Average Metric: 39.64 / 131 (30.3%):   9%|▊         | 131/1526 [00:00<00:07, 186.60it/s]



Average Metric: 39.64 / 132 (30.0%):   9%|▊         | 131/1526 [00:00<00:07, 186.60it/s]



Average Metric: 40.09 / 133 (30.1%):   9%|▊         | 132/1526 [00:00<00:07, 186.60it/s]



Average Metric: 40.38 / 134 (30.1%):   9%|▊         | 133/1526 [00:00<00:07, 186.60it/s]



Average Metric: 41.04 / 135 (30.4%):   9%|▉         | 134/1526 [00:00<00:07, 186.60it/s]



Average Metric: 41.22 / 136 (30.3%):   9%|▉         | 135/1526 [00:00<00:07, 186.60it/s]



Average Metric: 41.97 / 138 (30.4%):   9%|▉         | 137/1526 [00:00<00:07, 186.60it/s]



Average Metric: 41.97 / 139 (30.2%):   9%|▉         | 138/1526 [00:00<00:07, 186.60it/s]



Average Metric: 42.55 / 140 (30.4%):   9%|▉         | 139/1526 [00:00<00:07, 186.60it/s]



Average Metric: 42.95 / 141 (30.5%):   9%|▉         | 140/1526 [00:00<00:07, 186.60it/s]



Average Metric: 43.17 / 142 (30.4%):   9%|▉         | 141/1526 [00:00<00:07, 186.60it/s]



Average Metric: 43.33 / 143 (30.3%):   9%|▉         | 142/1526 [00:00<00:07, 186.60it/s]



Average Metric: 43.33 / 144 (30.1%):   9%|▉         | 143/1526 [00:00<00:07, 186.60it/s]



Average Metric: 43.62 / 145 (30.1%):   9%|▉         | 144/1526 [00:00<00:07, 186.60it/s]



Average Metric: 44.16 / 147 (30.0%):  10%|▉         | 146/1526 [00:00<00:07, 186.60it/s]



Average Metric: 44.41 / 148 (30.0%):  10%|▉         | 147/1526 [00:00<00:07, 186.60it/s]



Average Metric: 44.58 / 149 (29.9%):  10%|▉         | 148/1526 [00:00<00:07, 186.60it/s]



Average Metric: 45.38 / 150 (30.3%):  10%|▉         | 150/1526 [00:00<00:08, 169.10it/s]



Average Metric: 46.18 / 151 (30.6%):  10%|▉         | 150/1526 [00:00<00:08, 169.10it/s]



Average Metric: 46.18 / 152 (30.4%):  10%|▉         | 151/1526 [00:00<00:08, 169.10it/s]



Average Metric: 46.68 / 153 (30.5%):  10%|▉         | 152/1526 [00:00<00:08, 169.10it/s]



Average Metric: 47.46 / 155 (30.6%):  10%|█         | 154/1526 [00:00<00:08, 169.10it/s]



Average Metric: 47.75 / 156 (30.6%):  10%|█         | 155/1526 [00:00<00:08, 169.10it/s]



Average Metric: 48.19 / 157 (30.7%):  10%|█         | 156/1526 [00:00<00:08, 169.10it/s]



Average Metric: 48.19 / 158 (30.5%):  10%|█         | 157/1526 [00:00<00:08, 169.10it/s]



Average Metric: 48.99 / 159 (30.8%):  10%|█         | 158/1526 [00:00<00:08, 169.10it/s]



Average Metric: 49.57 / 160 (31.0%):  10%|█         | 159/1526 [00:00<00:08, 169.10it/s]



Average Metric: 50.07 / 161 (31.1%):  10%|█         | 160/1526 [00:00<00:08, 169.10it/s]



Average Metric: 50.60 / 162 (31.2%):  11%|█         | 161/1526 [00:00<00:08, 169.10it/s]



Average Metric: 51.14 / 164 (31.2%):  11%|█         | 163/1526 [00:00<00:08, 169.10it/s]



Average Metric: 51.64 / 165 (31.3%):  11%|█         | 164/1526 [00:00<00:08, 169.10it/s]



Average Metric: 51.98 / 166 (31.3%):  11%|█         | 165/1526 [00:00<00:08, 169.10it/s]



Average Metric: 51.98 / 167 (31.1%):  11%|█         | 166/1526 [00:00<00:08, 169.10it/s]



Average Metric: 52.38 / 168 (31.2%):  11%|█         | 167/1526 [00:00<00:08, 169.10it/s]



Average Metric: 52.38 / 168 (31.2%):  11%|█         | 168/1526 [00:00<00:08, 160.31it/s]



Average Metric: 52.77 / 169 (31.2%):  11%|█         | 168/1526 [00:01<00:08, 160.31it/s]



Average Metric: 53.11 / 170 (31.2%):  11%|█         | 169/1526 [00:01<00:08, 160.31it/s]



Average Metric: 53.80 / 172 (31.3%):  11%|█         | 171/1526 [00:01<00:08, 160.31it/s]



Average Metric: 54.02 / 173 (31.2%):  11%|█▏        | 172/1526 [00:01<00:08, 160.31it/s]



Average Metric: 54.54 / 175 (31.2%):  11%|█▏        | 174/1526 [00:01<00:08, 160.31it/s]



Average Metric: 55.19 / 177 (31.2%):  12%|█▏        | 176/1526 [00:01<00:08, 160.31it/s]



Average Metric: 55.52 / 178 (31.2%):  12%|█▏        | 177/1526 [00:01<00:08, 160.31it/s]



Average Metric: 55.81 / 179 (31.2%):  12%|█▏        | 178/1526 [00:01<00:08, 160.31it/s]



Average Metric: 56.08 / 180 (31.2%):  12%|█▏        | 179/1526 [00:01<00:08, 160.31it/s]



Average Metric: 56.26 / 181 (31.1%):  12%|█▏        | 180/1526 [00:01<00:08, 160.31it/s]



Average Metric: 56.80 / 183 (31.0%):  12%|█▏        | 182/1526 [00:01<00:08, 160.31it/s]



Average Metric: 56.80 / 184 (30.9%):  12%|█▏        | 183/1526 [00:01<00:08, 160.31it/s]



Average Metric: 56.95 / 185 (30.8%):  12%|█▏        | 184/1526 [00:01<00:08, 160.31it/s]



Average Metric: 57.39 / 186 (30.9%):  12%|█▏        | 185/1526 [00:01<00:11, 113.15it/s]



Average Metric: 57.55 / 187 (30.8%):  12%|█▏        | 186/1526 [00:01<00:11, 113.15it/s]



Average Metric: 57.55 / 188 (30.6%):  12%|█▏        | 187/1526 [00:01<00:11, 113.15it/s]



Average Metric: 58.16 / 189 (30.8%):  12%|█▏        | 188/1526 [00:01<00:11, 113.15it/s]



Average Metric: 58.45 / 190 (30.8%):  12%|█▏        | 189/1526 [00:01<00:11, 113.15it/s]



Average Metric: 58.45 / 191 (30.6%):  12%|█▏        | 190/1526 [00:01<00:11, 113.15it/s]



Average Metric: 58.73 / 193 (30.4%):  13%|█▎        | 192/1526 [00:01<00:11, 113.15it/s]



Average Metric: 59.73 / 194 (30.8%):  13%|█▎        | 193/1526 [00:01<00:11, 113.15it/s]



Average Metric: 60.07 / 195 (30.8%):  13%|█▎        | 194/1526 [00:01<00:11, 113.15it/s]



Average Metric: 60.07 / 196 (30.6%):  13%|█▎        | 195/1526 [00:01<00:11, 113.15it/s]



Average Metric: 60.32 / 197 (30.6%):  13%|█▎        | 196/1526 [00:01<00:11, 113.15it/s]



Average Metric: 60.75 / 198 (30.7%):  13%|█▎        | 197/1526 [00:01<00:11, 113.15it/s]



Average Metric: 61.32 / 199 (30.8%):  13%|█▎        | 198/1526 [00:01<00:11, 113.15it/s]



Average Metric: 61.82 / 200 (30.9%):  13%|█▎        | 199/1526 [00:01<00:11, 113.15it/s]



Average Metric: 62.16 / 202 (30.8%):  13%|█▎        | 201/1526 [00:01<00:11, 113.15it/s]



Average Metric: 62.16 / 202 (30.8%):  13%|█▎        | 202/1526 [00:01<00:10, 124.25it/s]



Average Metric: 62.98 / 204 (30.9%):  13%|█▎        | 203/1526 [00:01<00:10, 124.25it/s]



Average Metric: 63.29 / 205 (30.9%):  13%|█▎        | 204/1526 [00:01<00:10, 124.25it/s]



Average Metric: 63.79 / 207 (30.8%):  13%|█▎        | 206/1526 [00:01<00:10, 124.25it/s]



Average Metric: 64.46 / 208 (31.0%):  14%|█▎        | 207/1526 [00:01<00:10, 124.25it/s]



Average Metric: 64.96 / 209 (31.1%):  14%|█▎        | 208/1526 [00:01<00:10, 124.25it/s]



Average Metric: 65.50 / 211 (31.0%):  14%|█▍        | 210/1526 [00:01<00:10, 124.25it/s]



Average Metric: 65.72 / 212 (31.0%):  14%|█▍        | 211/1526 [00:01<00:10, 124.25it/s]



Average Metric: 65.93 / 213 (31.0%):  14%|█▍        | 212/1526 [00:01<00:10, 124.25it/s]



Average Metric: 66.40 / 215 (30.9%):  14%|█▍        | 214/1526 [00:01<00:10, 124.25it/s]



Average Metric: 66.85 / 216 (30.9%):  14%|█▍        | 215/1526 [00:01<00:10, 124.25it/s]



Average Metric: 67.07 / 217 (30.9%):  14%|█▍        | 216/1526 [00:01<00:10, 124.25it/s]



Average Metric: 67.07 / 218 (30.8%):  14%|█▍        | 217/1526 [00:01<00:10, 124.25it/s]



Average Metric: 67.32 / 219 (30.7%):  14%|█▍        | 218/1526 [00:01<00:10, 124.25it/s]



Average Metric: 67.81 / 221 (30.7%):  14%|█▍        | 220/1526 [00:01<00:10, 124.25it/s]



Average Metric: 68.36 / 223 (30.7%):  15%|█▍        | 222/1526 [00:01<00:10, 124.25it/s]



Average Metric: 68.81 / 224 (30.7%):  15%|█▍        | 223/1526 [00:01<00:09, 140.10it/s]



Average Metric: 69.18 / 225 (30.7%):  15%|█▍        | 224/1526 [00:01<00:09, 140.10it/s]



Average Metric: 70.42 / 227 (31.0%):  15%|█▍        | 226/1526 [00:01<00:09, 140.10it/s]



Average Metric: 70.97 / 228 (31.1%):  15%|█▍        | 227/1526 [00:01<00:09, 140.10it/s]



Average Metric: 71.11 / 229 (31.1%):  15%|█▍        | 228/1526 [00:01<00:09, 140.10it/s]



Average Metric: 71.31 / 230 (31.0%):  15%|█▌        | 229/1526 [00:01<00:09, 140.10it/s]



Average Metric: 71.76 / 232 (30.9%):  15%|█▌        | 231/1526 [00:01<00:09, 140.10it/s]



Average Metric: 72.32 / 233 (31.0%):  15%|█▌        | 232/1526 [00:01<00:09, 140.10it/s]



Average Metric: 72.81 / 235 (31.0%):  15%|█▌        | 234/1526 [00:01<00:09, 140.10it/s]



Average Metric: 73.38 / 236 (31.1%):  15%|█▌        | 235/1526 [00:01<00:09, 140.10it/s]



Average Metric: 74.38 / 237 (31.4%):  15%|█▌        | 236/1526 [00:01<00:09, 140.10it/s]



Average Metric: 74.63 / 238 (31.4%):  16%|█▌        | 237/1526 [00:01<00:09, 140.10it/s]



Average Metric: 74.63 / 239 (31.2%):  16%|█▌        | 238/1526 [00:01<00:09, 140.10it/s]



Average Metric: 75.13 / 240 (31.3%):  16%|█▌        | 239/1526 [00:01<00:09, 140.10it/s]



Average Metric: 75.50 / 241 (31.3%):  16%|█▌        | 240/1526 [00:01<00:09, 140.10it/s]



Average Metric: 75.50 / 243 (31.1%):  16%|█▌        | 242/1526 [00:01<00:09, 140.10it/s]



Average Metric: 75.90 / 245 (31.0%):  16%|█▌        | 244/1526 [00:01<00:09, 140.10it/s]



Average Metric: 76.18 / 246 (31.0%):  16%|█▌        | 245/1526 [00:01<00:09, 140.10it/s]



Average Metric: 76.18 / 246 (31.0%):  16%|█▌        | 246/1526 [00:01<00:07, 160.81it/s]



Average Metric: 76.18 / 248 (30.7%):  16%|█▌        | 247/1526 [00:01<00:07, 160.81it/s]



Average Metric: 76.58 / 249 (30.8%):  16%|█▋        | 248/1526 [00:01<00:07, 160.81it/s]



Average Metric: 76.87 / 250 (30.7%):  16%|█▋        | 249/1526 [00:01<00:07, 160.81it/s]



Average Metric: 77.62 / 251 (30.9%):  16%|█▋        | 250/1526 [00:01<00:07, 160.81it/s]



Average Metric: 78.47 / 253 (31.0%):  17%|█▋        | 252/1526 [00:01<00:07, 160.81it/s]



Average Metric: 78.69 / 254 (31.0%):  17%|█▋        | 253/1526 [00:01<00:07, 160.81it/s]



Average Metric: 78.89 / 255 (30.9%):  17%|█▋        | 254/1526 [00:01<00:07, 160.81it/s]



Average Metric: 79.47 / 256 (31.0%):  17%|█▋        | 255/1526 [00:01<00:07, 160.81it/s]



Average Metric: 79.83 / 258 (30.9%):  17%|█▋        | 257/1526 [00:01<00:07, 160.81it/s]



Average Metric: 80.12 / 259 (30.9%):  17%|█▋        | 258/1526 [00:01<00:07, 160.81it/s]



Average Metric: 80.52 / 261 (30.8%):  17%|█▋        | 260/1526 [00:01<00:07, 160.81it/s]



Average Metric: 80.81 / 262 (30.8%):  17%|█▋        | 261/1526 [00:01<00:07, 160.81it/s]



Average Metric: 81.95 / 264 (31.0%):  17%|█▋        | 263/1526 [00:01<00:07, 160.81it/s]



Average Metric: 81.95 / 264 (31.0%):  17%|█▋        | 264/1526 [00:01<00:07, 162.61it/s]



Average Metric: 81.95 / 265 (30.9%):  17%|█▋        | 264/1526 [00:01<00:07, 162.61it/s]



Average Metric: 82.30 / 266 (30.9%):  17%|█▋        | 265/1526 [00:01<00:07, 162.61it/s]



Average Metric: 82.58 / 268 (30.8%):  17%|█▋        | 267/1526 [00:01<00:07, 162.61it/s]



Average Metric: 82.92 / 269 (30.8%):  18%|█▊        | 268/1526 [00:01<00:07, 162.61it/s]



Average Metric: 83.25 / 270 (30.8%):  18%|█▊        | 269/1526 [00:01<00:07, 162.61it/s]



Average Metric: 84.17 / 272 (30.9%):  18%|█▊        | 271/1526 [00:01<00:07, 162.61it/s]



Average Metric: 84.17 / 273 (30.8%):  18%|█▊        | 272/1526 [00:01<00:07, 162.61it/s]



Average Metric: 84.78 / 274 (30.9%):  18%|█▊        | 273/1526 [00:01<00:07, 162.61it/s]



Average Metric: 85.53 / 276 (31.0%):  18%|█▊        | 275/1526 [00:01<00:07, 162.61it/s]



Average Metric: 85.86 / 277 (31.0%):  18%|█▊        | 276/1526 [00:01<00:07, 162.61it/s]



Average Metric: 85.86 / 278 (30.9%):  18%|█▊        | 277/1526 [00:01<00:07, 162.61it/s]



Average Metric: 86.26 / 279 (30.9%):  18%|█▊        | 278/1526 [00:01<00:07, 162.61it/s]



Average Metric: 86.77 / 280 (31.0%):  18%|█▊        | 279/1526 [00:01<00:07, 162.61it/s]



Average Metric: 87.10 / 281 (31.0%):  18%|█▊        | 280/1526 [00:01<00:07, 162.61it/s]



Average Metric: 87.72 / 282 (31.1%):  18%|█▊        | 281/1526 [00:01<00:07, 162.61it/s]



Average Metric: 87.72 / 282 (31.1%):  18%|█▊        | 282/1526 [00:01<00:08, 138.88it/s]



Average Metric: 87.72 / 284 (30.9%):  19%|█▊        | 283/1526 [00:01<00:08, 138.88it/s]



Average Metric: 88.52 / 285 (31.1%):  19%|█▊        | 284/1526 [00:01<00:08, 138.88it/s]



Average Metric: 89.08 / 287 (31.0%):  19%|█▊        | 286/1526 [00:01<00:08, 138.88it/s]



Average Metric: 89.33 / 288 (31.0%):  19%|█▉        | 287/1526 [00:01<00:08, 138.88it/s]



Average Metric: 89.93 / 289 (31.1%):  19%|█▉        | 288/1526 [00:01<00:08, 138.88it/s]



Average Metric: 90.54 / 291 (31.1%):  19%|█▉        | 290/1526 [00:01<00:08, 138.88it/s]



Average Metric: 91.21 / 292 (31.2%):  19%|█▉        | 291/1526 [00:01<00:08, 138.88it/s]



Average Metric: 91.21 / 293 (31.1%):  19%|█▉        | 292/1526 [00:01<00:08, 138.88it/s]



Average Metric: 92.06 / 295 (31.2%):  19%|█▉        | 294/1526 [00:01<00:08, 138.88it/s]



Average Metric: 92.53 / 296 (31.3%):  19%|█▉        | 295/1526 [00:02<00:08, 138.88it/s]



Average Metric: 93.23 / 297 (31.4%):  19%|█▉        | 296/1526 [00:02<00:08, 138.88it/s]



Average Metric: 93.61 / 298 (31.4%):  19%|█▉        | 297/1526 [00:02<00:08, 138.88it/s]



Average Metric: 95.02 / 303 (31.4%):  20%|█▉        | 302/1526 [00:02<00:10, 116.60it/s]



Average Metric: 96.02 / 306 (31.4%):  20%|█▉        | 305/1526 [00:02<00:10, 116.60it/s]



Average Metric: 96.42 / 308 (31.3%):  20%|██        | 307/1526 [00:02<00:10, 116.60it/s]



Average Metric: 96.42 / 309 (31.2%):  20%|██        | 308/1526 [00:02<00:10, 116.60it/s]



Average Metric: 96.82 / 310 (31.2%):  20%|██        | 309/1526 [00:02<00:10, 116.60it/s]



Average Metric: 96.82 / 311 (31.1%):  20%|██        | 310/1526 [00:02<00:10, 116.60it/s]



Average Metric: 97.07 / 312 (31.1%):  20%|██        | 311/1526 [00:02<00:10, 116.60it/s]



Average Metric: 97.07 / 312 (31.1%):  20%|██        | 312/1526 [00:02<00:12, 100.20it/s]



Average Metric: 97.57 / 313 (31.2%):  20%|██        | 312/1526 [00:02<00:12, 100.20it/s]



Average Metric: 98.07 / 314 (31.2%):  21%|██        | 313/1526 [00:02<00:12, 100.20it/s]



Average Metric: 98.57 / 315 (31.3%):  21%|██        | 314/1526 [00:02<00:12, 100.20it/s]



Average Metric: 98.57 / 316 (31.2%):  21%|██        | 315/1526 [00:02<00:12, 100.20it/s]



Average Metric: 98.57 / 317 (31.1%):  21%|██        | 316/1526 [00:02<00:12, 100.20it/s]



Average Metric: 99.23 / 318 (31.2%):  21%|██        | 317/1526 [00:02<00:12, 100.20it/s]



Average Metric: 99.95 / 320 (31.2%):  21%|██        | 319/1526 [00:02<00:12, 100.20it/s]



Average Metric: 99.95 / 321 (31.1%):  21%|██        | 320/1526 [00:02<00:12, 100.20it/s]



Average Metric: 100.61 / 322 (31.2%):  21%|██        | 321/1526 [00:02<00:12, 100.20it/s]



Average Metric: 101.06 / 324 (31.2%):  21%|██        | 323/1526 [00:02<00:12, 100.20it/s]



Average Metric: 101.06 / 324 (31.2%):  21%|██        | 324/1526 [00:02<00:12, 99.76it/s] 



Average Metric: 101.83 / 325 (31.3%):  21%|██        | 324/1526 [00:02<00:12, 99.76it/s]



Average Metric: 102.11 / 327 (31.2%):  21%|██▏       | 326/1526 [00:02<00:12, 99.76it/s]



Average Metric: 102.58 / 328 (31.3%):  21%|██▏       | 327/1526 [00:02<00:12, 99.76it/s]



Average Metric: 102.58 / 329 (31.2%):  21%|██▏       | 328/1526 [00:02<00:12, 99.76it/s]



Average Metric: 103.32 / 331 (31.2%):  22%|██▏       | 330/1526 [00:02<00:11, 99.76it/s]



Average Metric: 103.32 / 332 (31.1%):  22%|██▏       | 331/1526 [00:02<00:11, 99.76it/s]



Average Metric: 103.76 / 333 (31.2%):  22%|██▏       | 332/1526 [00:02<00:11, 99.76it/s]



Average Metric: 104.32 / 335 (31.1%):  22%|██▏       | 334/1526 [00:02<00:11, 99.76it/s]



Average Metric: 104.32 / 337 (31.0%):  22%|██▏       | 336/1526 [00:02<00:11, 99.76it/s]



Average Metric: 104.80 / 338 (31.0%):  22%|██▏       | 337/1526 [00:02<00:11, 99.76it/s]



Average Metric: 104.80 / 339 (30.9%):  22%|██▏       | 338/1526 [00:02<00:11, 99.76it/s]



Average Metric: 104.80 / 339 (30.9%):  22%|██▏       | 339/1526 [00:02<00:10, 110.32it/s]



Average Metric: 105.02 / 340 (30.9%):  22%|██▏       | 339/1526 [00:02<00:10, 110.32it/s]



Average Metric: 105.52 / 342 (30.9%):  22%|██▏       | 341/1526 [00:02<00:10, 110.32it/s]



Average Metric: 105.75 / 344 (30.7%):  22%|██▏       | 343/1526 [00:02<00:10, 110.32it/s]



Average Metric: 106.09 / 345 (30.7%):  23%|██▎       | 344/1526 [00:02<00:10, 110.32it/s]



Average Metric: 106.66 / 346 (30.8%):  23%|██▎       | 345/1526 [00:02<00:10, 110.32it/s]



Average Metric: 107.38 / 348 (30.9%):  23%|██▎       | 347/1526 [00:02<00:10, 110.32it/s]



Average Metric: 107.38 / 349 (30.8%):  23%|██▎       | 348/1526 [00:02<00:10, 110.32it/s]



Average Metric: 107.69 / 350 (30.8%):  23%|██▎       | 349/1526 [00:02<00:10, 110.32it/s]



Average Metric: 108.02 / 351 (30.8%):  23%|██▎       | 350/1526 [00:02<00:10, 110.32it/s]



Average Metric: 108.27 / 353 (30.7%):  23%|██▎       | 352/1526 [00:02<00:10, 110.32it/s]



Average Metric: 108.27 / 354 (30.6%):  23%|██▎       | 353/1526 [00:02<00:10, 110.32it/s]



Average Metric: 108.60 / 355 (30.6%):  23%|██▎       | 354/1526 [00:02<00:10, 110.32it/s]



Average Metric: 108.89 / 356 (30.6%):  23%|██▎       | 356/1526 [00:02<00:09, 124.29it/s]



Average Metric: 108.89 / 357 (30.5%):  23%|██▎       | 356/1526 [00:02<00:09, 124.29it/s]



Average Metric: 109.55 / 359 (30.5%):  23%|██▎       | 358/1526 [00:02<00:09, 124.29it/s]



Average Metric: 109.55 / 360 (30.4%):  24%|██▎       | 359/1526 [00:02<00:09, 124.29it/s]



Average Metric: 109.84 / 361 (30.4%):  24%|██▎       | 360/1526 [00:02<00:09, 124.29it/s]



Average Metric: 109.84 / 362 (30.3%):  24%|██▎       | 361/1526 [00:02<00:09, 124.29it/s]



Average Metric: 110.41 / 363 (30.4%):  24%|██▎       | 362/1526 [00:02<00:09, 124.29it/s]



Average Metric: 110.79 / 365 (30.4%):  24%|██▍       | 364/1526 [00:02<00:09, 124.29it/s]



Average Metric: 111.32 / 367 (30.3%):  24%|██▍       | 366/1526 [00:02<00:09, 124.29it/s]



Average Metric: 111.57 / 368 (30.3%):  24%|██▍       | 367/1526 [00:02<00:09, 124.29it/s]



Average Metric: 112.15 / 370 (30.3%):  24%|██▍       | 369/1526 [00:02<00:09, 124.29it/s]



Average Metric: 112.77 / 371 (30.4%):  24%|██▍       | 370/1526 [00:02<00:09, 124.29it/s]



Average Metric: 112.77 / 372 (30.3%):  24%|██▍       | 371/1526 [00:02<00:09, 124.29it/s]



Average Metric: 113.05 / 373 (30.3%):  24%|██▍       | 372/1526 [00:02<00:09, 124.29it/s]



Average Metric: 114.00 / 375 (30.4%):  25%|██▍       | 374/1526 [00:02<00:09, 124.29it/s]



Average Metric: 115.07 / 377 (30.5%):  25%|██▍       | 376/1526 [00:02<00:09, 124.29it/s]



Average Metric: 115.29 / 378 (30.5%):  25%|██▍       | 377/1526 [00:02<00:09, 124.29it/s]



Average Metric: 115.82 / 379 (30.6%):  25%|██▍       | 379/1526 [00:02<00:07, 150.54it/s]



Average Metric: 116.22 / 380 (30.6%):  25%|██▍       | 379/1526 [00:02<00:07, 150.54it/s]



Average Metric: 116.22 / 381 (30.5%):  25%|██▍       | 380/1526 [00:02<00:07, 150.54it/s]



Average Metric: 116.44 / 383 (30.4%):  25%|██▌       | 382/1526 [00:02<00:07, 150.54it/s]



Average Metric: 116.44 / 384 (30.3%):  25%|██▌       | 383/1526 [00:02<00:07, 150.54it/s]



Average Metric: 116.81 / 385 (30.3%):  25%|██▌       | 384/1526 [00:02<00:07, 150.54it/s]



Average Metric: 117.17 / 386 (30.4%):  25%|██▌       | 385/1526 [00:02<00:07, 150.54it/s]



Average Metric: 117.92 / 387 (30.5%):  25%|██▌       | 386/1526 [00:02<00:07, 150.54it/s]



Average Metric: 118.20 / 388 (30.5%):  25%|██▌       | 387/1526 [00:02<00:07, 150.54it/s]



Average Metric: 118.87 / 390 (30.5%):  25%|██▌       | 389/1526 [00:02<00:07, 150.54it/s]



Average Metric: 119.67 / 392 (30.5%):  26%|██▌       | 391/1526 [00:02<00:07, 150.54it/s]



Average Metric: 120.17 / 394 (30.5%):  26%|██▌       | 393/1526 [00:02<00:07, 150.54it/s]



Average Metric: 120.67 / 395 (30.6%):  26%|██▌       | 394/1526 [00:02<00:07, 150.54it/s]



Average Metric: 120.67 / 396 (30.5%):  26%|██▌       | 395/1526 [00:02<00:07, 150.54it/s]



Average Metric: 121.58 / 398 (30.5%):  26%|██▌       | 397/1526 [00:02<00:07, 150.54it/s]



Average Metric: 121.78 / 400 (30.4%):  26%|██▌       | 399/1526 [00:02<00:07, 150.54it/s]



Average Metric: 122.32 / 402 (30.4%):  26%|██▋       | 401/1526 [00:02<00:07, 150.54it/s]



Average Metric: 122.84 / 404 (30.4%):  26%|██▋       | 403/1526 [00:02<00:07, 150.54it/s]



Average Metric: 122.84 / 405 (30.3%):  26%|██▋       | 404/1526 [00:02<00:07, 150.54it/s]



Average Metric: 123.17 / 406 (30.3%):  27%|██▋       | 405/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.60 / 408 (30.3%):  27%|██▋       | 407/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.60 / 409 (30.2%):  27%|██▋       | 408/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.81 / 410 (30.2%):  27%|██▋       | 409/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.81 / 411 (30.1%):  27%|██▋       | 410/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.81 / 412 (30.1%):  27%|██▋       | 411/1526 [00:02<00:06, 178.80it/s]



Average Metric: 123.81 / 413 (30.0%):  27%|██▋       | 412/1526 [00:02<00:06, 178.80it/s]



Average Metric: 124.38 / 414 (30.0%):  27%|██▋       | 413/1526 [00:02<00:06, 178.80it/s]



Average Metric: 124.38 / 415 (30.0%):  27%|██▋       | 414/1526 [00:02<00:06, 178.80it/s]



Average Metric: 124.67 / 416 (30.0%):  27%|██▋       | 415/1526 [00:02<00:06, 178.80it/s]



Average Metric: 125.34 / 418 (30.0%):  27%|██▋       | 417/1526 [00:02<00:06, 178.80it/s]



Average Metric: 125.74 / 419 (30.0%):  27%|██▋       | 418/1526 [00:02<00:06, 178.80it/s]



Average Metric: 126.04 / 421 (29.9%):  28%|██▊       | 420/1526 [00:02<00:06, 178.80it/s]



Average Metric: 126.27 / 422 (29.9%):  28%|██▊       | 421/1526 [00:02<00:06, 178.80it/s]



Average Metric: 126.27 / 423 (29.9%):  28%|██▊       | 422/1526 [00:02<00:06, 178.80it/s]



Average Metric: 126.60 / 424 (29.9%):  28%|██▊       | 423/1526 [00:02<00:06, 178.80it/s]



Average Metric: 126.77 / 425 (29.8%):  28%|██▊       | 425/1526 [00:02<00:06, 178.14it/s]



Average Metric: 127.41 / 427 (29.8%):  28%|██▊       | 426/1526 [00:02<00:06, 178.14it/s]



Average Metric: 128.41 / 429 (29.9%):  28%|██▊       | 428/1526 [00:02<00:06, 178.14it/s]



Average Metric: 128.63 / 430 (29.9%):  28%|██▊       | 429/1526 [00:02<00:06, 178.14it/s]



Average Metric: 128.63 / 431 (29.8%):  28%|██▊       | 430/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.32 / 433 (29.9%):  28%|██▊       | 432/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.61 / 434 (29.9%):  28%|██▊       | 433/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.61 / 435 (29.8%):  28%|██▊       | 434/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.61 / 436 (29.7%):  29%|██▊       | 435/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.61 / 437 (29.7%):  29%|██▊       | 436/1526 [00:02<00:06, 178.14it/s]



Average Metric: 129.61 / 438 (29.6%):  29%|██▊       | 437/1526 [00:02<00:06, 178.14it/s]



Average Metric: 130.28 / 440 (29.6%):  29%|██▉       | 439/1526 [00:02<00:06, 178.14it/s]



Average Metric: 131.01 / 441 (29.7%):  29%|██▉       | 440/1526 [00:02<00:06, 178.14it/s]



Average Metric: 131.01 / 442 (29.6%):  29%|██▉       | 441/1526 [00:02<00:06, 178.14it/s]



Average Metric: 131.01 / 443 (29.6%):  29%|██▉       | 442/1526 [00:02<00:06, 178.14it/s]



Average Metric: 131.67 / 444 (29.7%):  29%|██▉       | 443/1526 [00:02<00:06, 178.14it/s]



Average Metric: 132.17 / 445 (29.7%):  29%|██▉       | 444/1526 [00:03<00:06, 179.52it/s]



Average Metric: 132.44 / 446 (29.7%):  29%|██▉       | 445/1526 [00:03<00:06, 179.52it/s]



Average Metric: 132.99 / 447 (29.8%):  29%|██▉       | 446/1526 [00:03<00:06, 179.52it/s]



Average Metric: 133.49 / 449 (29.7%):  29%|██▉       | 448/1526 [00:03<00:06, 179.52it/s]



Average Metric: 133.89 / 450 (29.8%):  29%|██▉       | 449/1526 [00:03<00:05, 179.52it/s]



Average Metric: 134.57 / 451 (29.8%):  29%|██▉       | 450/1526 [00:03<00:05, 179.52it/s]



Average Metric: 135.59 / 453 (29.9%):  30%|██▉       | 452/1526 [00:03<00:05, 179.52it/s]



Average Metric: 135.72 / 454 (29.9%):  30%|██▉       | 453/1526 [00:03<00:05, 179.52it/s]



Average Metric: 136.17 / 455 (29.9%):  30%|██▉       | 454/1526 [00:03<00:05, 179.52it/s]



Average Metric: 136.50 / 456 (29.9%):  30%|██▉       | 455/1526 [00:03<00:05, 179.52it/s]



Average Metric: 136.79 / 457 (29.9%):  30%|██▉       | 456/1526 [00:03<00:05, 179.52it/s]



Average Metric: 137.07 / 458 (29.9%):  30%|██▉       | 457/1526 [00:03<00:05, 179.52it/s]



Average Metric: 137.69 / 460 (29.9%):  30%|███       | 459/1526 [00:03<00:05, 179.52it/s]



Average Metric: 138.13 / 461 (30.0%):  30%|███       | 460/1526 [00:03<00:05, 179.52it/s]



Average Metric: 138.97 / 463 (30.0%):  30%|███       | 462/1526 [00:03<00:05, 179.52it/s]



Average Metric: 138.97 / 464 (29.9%):  30%|███       | 463/1526 [00:03<00:05, 179.52it/s]



Average Metric: 139.22 / 465 (29.9%):  30%|███       | 464/1526 [00:03<00:05, 181.43it/s]



Average Metric: 139.22 / 466 (29.9%):  30%|███       | 465/1526 [00:03<00:05, 181.43it/s]



Average Metric: 139.22 / 467 (29.8%):  31%|███       | 466/1526 [00:03<00:05, 181.43it/s]



Average Metric: 139.40 / 469 (29.7%):  31%|███       | 468/1526 [00:03<00:05, 181.43it/s]



Average Metric: 140.00 / 470 (29.8%):  31%|███       | 469/1526 [00:03<00:05, 181.43it/s]



Average Metric: 140.29 / 472 (29.7%):  31%|███       | 471/1526 [00:03<00:05, 181.43it/s]



Average Metric: 140.90 / 473 (29.8%):  31%|███       | 472/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.07 / 474 (29.8%):  31%|███       | 473/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.07 / 475 (29.7%):  31%|███       | 474/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.07 / 476 (29.6%):  31%|███       | 475/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.07 / 477 (29.6%):  31%|███       | 476/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.47 / 478 (29.6%):  31%|███▏      | 477/1526 [00:03<00:05, 181.43it/s]



Average Metric: 141.75 / 479 (29.6%):  31%|███▏      | 478/1526 [00:03<00:05, 181.43it/s]



Average Metric: 142.25 / 480 (29.6%):  31%|███▏      | 479/1526 [00:03<00:05, 181.43it/s]



Average Metric: 142.80 / 482 (29.6%):  32%|███▏      | 481/1526 [00:03<00:05, 181.43it/s]



Average Metric: 143.75 / 484 (29.7%):  32%|███▏      | 483/1526 [00:03<00:05, 181.43it/s]



Average Metric: 143.75 / 485 (29.6%):  32%|███▏      | 484/1526 [00:03<00:05, 181.43it/s]



Average Metric: 144.32 / 487 (29.6%):  32%|███▏      | 486/1526 [00:03<00:05, 181.43it/s]



Average Metric: 144.32 / 487 (29.6%):  32%|███▏      | 487/1526 [00:03<00:05, 194.52it/s]



Average Metric: 145.55 / 489 (29.8%):  32%|███▏      | 488/1526 [00:03<00:05, 194.52it/s]



Average Metric: 145.55 / 490 (29.7%):  32%|███▏      | 489/1526 [00:03<00:05, 194.52it/s]



Average Metric: 145.81 / 492 (29.6%):  32%|███▏      | 491/1526 [00:03<00:05, 194.52it/s]



Average Metric: 145.81 / 493 (29.6%):  32%|███▏      | 492/1526 [00:03<00:05, 194.52it/s]



Average Metric: 146.21 / 494 (29.6%):  32%|███▏      | 493/1526 [00:03<00:05, 194.52it/s]



Average Metric: 147.21 / 496 (29.7%):  32%|███▏      | 495/1526 [00:03<00:05, 194.52it/s]



Average Metric: 147.80 / 498 (29.7%):  33%|███▎      | 497/1526 [00:03<00:05, 194.52it/s]



Average Metric: 147.80 / 499 (29.6%):  33%|███▎      | 498/1526 [00:03<00:05, 194.52it/s]



Average Metric: 148.16 / 500 (29.6%):  33%|███▎      | 499/1526 [00:03<00:05, 194.52it/s]



Average Metric: 148.49 / 501 (29.6%):  33%|███▎      | 500/1526 [00:03<00:05, 194.52it/s]



Average Metric: 148.71 / 502 (29.6%):  33%|███▎      | 501/1526 [00:03<00:05, 194.52it/s]



Average Metric: 149.29 / 503 (29.7%):  33%|███▎      | 502/1526 [00:03<00:05, 194.52it/s]



Average Metric: 149.57 / 504 (29.7%):  33%|███▎      | 503/1526 [00:03<00:05, 194.52it/s]



Average Metric: 149.86 / 505 (29.7%):  33%|███▎      | 504/1526 [00:03<00:05, 194.52it/s]



Average Metric: 150.14 / 506 (29.7%):  33%|███▎      | 505/1526 [00:03<00:05, 194.52it/s]



Average Metric: 150.36 / 507 (29.7%):  33%|███▎      | 506/1526 [00:03<00:05, 194.52it/s]



Average Metric: 150.77 / 509 (29.6%):  33%|███▎      | 508/1526 [00:03<00:05, 194.52it/s]



Average Metric: 151.34 / 510 (29.7%):  33%|███▎      | 509/1526 [00:03<00:05, 194.52it/s]



Average Metric: 151.59 / 511 (29.7%):  33%|███▎      | 510/1526 [00:03<00:05, 194.52it/s]



Average Metric: 151.99 / 512 (29.7%):  33%|███▎      | 511/1526 [00:03<00:05, 194.52it/s]



Average Metric: 152.24 / 514 (29.6%):  34%|███▎      | 513/1526 [00:03<00:05, 194.52it/s]



Average Metric: 152.24 / 515 (29.6%):  34%|███▎      | 514/1526 [00:03<00:05, 194.52it/s]



Average Metric: 152.51 / 516 (29.6%):  34%|███▎      | 515/1526 [00:03<00:05, 194.52it/s]



Average Metric: 152.91 / 517 (29.6%):  34%|███▍      | 516/1526 [00:03<00:05, 194.52it/s]



Average Metric: 153.53 / 518 (29.6%):  34%|███▍      | 517/1526 [00:03<00:05, 194.52it/s]



Average Metric: 153.86 / 520 (29.6%):  34%|███▍      | 519/1526 [00:03<00:05, 194.52it/s]



Average Metric: 154.40 / 521 (29.6%):  34%|███▍      | 520/1526 [00:03<00:05, 194.52it/s]



Average Metric: 154.40 / 521 (29.6%):  34%|███▍      | 521/1526 [00:03<00:04, 234.48it/s]



Average Metric: 154.97 / 523 (29.6%):  34%|███▍      | 522/1526 [00:03<00:04, 234.48it/s]



Average Metric: 154.97 / 524 (29.6%):  34%|███▍      | 523/1526 [00:03<00:04, 234.48it/s]



Average Metric: 154.97 / 525 (29.5%):  34%|███▍      | 524/1526 [00:03<00:04, 234.48it/s]



Average Metric: 155.04 / 526 (29.5%):  34%|███▍      | 525/1526 [00:03<00:04, 234.48it/s]



Average Metric: 156.04 / 527 (29.6%):  34%|███▍      | 526/1526 [00:03<00:04, 234.48it/s]



Average Metric: 157.14 / 529 (29.7%):  35%|███▍      | 528/1526 [00:03<00:04, 234.48it/s]



Average Metric: 157.14 / 530 (29.6%):  35%|███▍      | 529/1526 [00:03<00:04, 234.48it/s]



Average Metric: 157.71 / 531 (29.7%):  35%|███▍      | 530/1526 [00:03<00:04, 234.48it/s]



Average Metric: 157.99 / 532 (29.7%):  35%|███▍      | 531/1526 [00:03<00:04, 234.48it/s]



Average Metric: 158.84 / 534 (29.7%):  35%|███▍      | 533/1526 [00:03<00:04, 234.48it/s]



Average Metric: 159.09 / 535 (29.7%):  35%|███▍      | 534/1526 [00:03<00:04, 234.48it/s]



Average Metric: 159.49 / 536 (29.8%):  35%|███▌      | 535/1526 [00:03<00:04, 234.48it/s]



Average Metric: 159.74 / 537 (29.7%):  35%|███▌      | 536/1526 [00:03<00:04, 234.48it/s]



Average Metric: 160.16 / 538 (29.8%):  35%|███▌      | 537/1526 [00:03<00:04, 234.48it/s]



Average Metric: 160.66 / 540 (29.8%):  35%|███▌      | 539/1526 [00:03<00:04, 234.48it/s]



Average Metric: 161.28 / 542 (29.8%):  35%|███▌      | 541/1526 [00:03<00:04, 234.48it/s]



Average Metric: 161.78 / 543 (29.8%):  36%|███▌      | 542/1526 [00:03<00:04, 234.48it/s]



Average Metric: 162.03 / 545 (29.7%):  36%|███▌      | 544/1526 [00:03<00:04, 234.48it/s]



Average Metric: 162.47 / 546 (29.8%):  36%|███▌      | 545/1526 [00:03<00:04, 234.48it/s]



Average Metric: 163.09 / 547 (29.8%):  36%|███▌      | 546/1526 [00:03<00:04, 234.48it/s]



Average Metric: 163.09 / 547 (29.8%):  36%|███▌      | 547/1526 [00:03<00:04, 240.56it/s]



Average Metric: 163.49 / 548 (29.8%):  36%|███▌      | 547/1526 [00:03<00:04, 240.56it/s]



Average Metric: 163.82 / 549 (29.8%):  36%|███▌      | 548/1526 [00:03<00:04, 240.56it/s]



Average Metric: 164.38 / 551 (29.8%):  36%|███▌      | 550/1526 [00:03<00:04, 240.56it/s]



Average Metric: 165.23 / 553 (29.9%):  36%|███▌      | 552/1526 [00:03<00:04, 240.56it/s]



Average Metric: 165.83 / 555 (29.9%):  36%|███▋      | 554/1526 [00:03<00:04, 240.56it/s]



Average Metric: 166.28 / 556 (29.9%):  36%|███▋      | 555/1526 [00:03<00:04, 240.56it/s]



Average Metric: 166.48 / 558 (29.8%):  37%|███▋      | 557/1526 [00:03<00:04, 240.56it/s]



Average Metric: 166.48 / 559 (29.8%):  37%|███▋      | 558/1526 [00:03<00:04, 240.56it/s]



Average Metric: 166.73 / 561 (29.7%):  37%|███▋      | 560/1526 [00:03<00:04, 240.56it/s]



Average Metric: 167.46 / 563 (29.7%):  37%|███▋      | 562/1526 [00:03<00:04, 240.56it/s]



Average Metric: 168.03 / 564 (29.8%):  37%|███▋      | 563/1526 [00:03<00:04, 240.56it/s]



Average Metric: 168.37 / 566 (29.7%):  37%|███▋      | 565/1526 [00:03<00:03, 240.56it/s]



Average Metric: 168.98 / 567 (29.8%):  37%|███▋      | 566/1526 [00:03<00:03, 240.56it/s]



Average Metric: 169.21 / 568 (29.8%):  37%|███▋      | 567/1526 [00:03<00:03, 240.56it/s]



Average Metric: 169.21 / 569 (29.7%):  37%|███▋      | 568/1526 [00:03<00:03, 240.56it/s]



Average Metric: 170.38 / 571 (29.8%):  37%|███▋      | 570/1526 [00:03<00:03, 240.56it/s]



Average Metric: 170.64 / 572 (29.8%):  37%|███▋      | 572/1526 [00:03<00:06, 136.85it/s]



Average Metric: 171.35 / 574 (29.9%):  38%|███▊      | 573/1526 [00:03<00:06, 136.85it/s]



Average Metric: 171.35 / 575 (29.8%):  38%|███▊      | 574/1526 [00:03<00:06, 136.85it/s]



Average Metric: 172.01 / 576 (29.9%):  38%|███▊      | 575/1526 [00:04<00:06, 136.85it/s]



Average Metric: 172.21 / 577 (29.8%):  38%|███▊      | 576/1526 [00:04<00:06, 136.85it/s]



Average Metric: 172.21 / 578 (29.8%):  38%|███▊      | 577/1526 [00:04<00:06, 136.85it/s]



Average Metric: 172.50 / 580 (29.7%):  38%|███▊      | 579/1526 [00:04<00:06, 136.85it/s]



Average Metric: 172.83 / 581 (29.7%):  38%|███▊      | 580/1526 [00:04<00:06, 136.85it/s]



Average Metric: 173.49 / 583 (29.8%):  38%|███▊      | 582/1526 [00:04<00:06, 136.85it/s]



Average Metric: 173.49 / 584 (29.7%):  38%|███▊      | 583/1526 [00:04<00:06, 136.85it/s]



Average Metric: 174.16 / 585 (29.8%):  38%|███▊      | 584/1526 [00:04<00:06, 136.85it/s]



Average Metric: 174.16 / 586 (29.7%):  38%|███▊      | 585/1526 [00:04<00:06, 136.85it/s]



Average Metric: 174.85 / 588 (29.7%):  38%|███▊      | 587/1526 [00:04<00:06, 136.85it/s]



Average Metric: 175.79 / 589 (29.8%):  39%|███▊      | 588/1526 [00:04<00:06, 136.85it/s]



Average Metric: 176.10 / 590 (29.8%):  39%|███▊      | 589/1526 [00:04<00:06, 136.85it/s]



Average Metric: 176.85 / 592 (29.9%):  39%|███▊      | 591/1526 [00:04<00:06, 136.85it/s]



Average Metric: 176.85 / 592 (29.9%):  39%|███▉      | 592/1526 [00:04<00:09, 100.91it/s]



Average Metric: 176.85 / 593 (29.8%):  39%|███▉      | 592/1526 [00:04<00:09, 100.91it/s]



Average Metric: 177.01 / 594 (29.8%):  39%|███▉      | 593/1526 [00:04<00:09, 100.91it/s]



Average Metric: 177.01 / 595 (29.7%):  39%|███▉      | 594/1526 [00:04<00:09, 100.91it/s]



Average Metric: 177.23 / 597 (29.7%):  39%|███▉      | 596/1526 [00:04<00:09, 100.91it/s]



Average Metric: 177.41 / 598 (29.7%):  39%|███▉      | 597/1526 [00:04<00:09, 100.91it/s]



Average Metric: 178.12 / 600 (29.7%):  39%|███▉      | 599/1526 [00:04<00:09, 100.91it/s]



Average Metric: 178.52 / 601 (29.7%):  39%|███▉      | 600/1526 [00:04<00:09, 100.91it/s]



Average Metric: 179.19 / 602 (29.8%):  39%|███▉      | 601/1526 [00:04<00:09, 100.91it/s]



Average Metric: 179.52 / 603 (29.8%):  39%|███▉      | 602/1526 [00:04<00:09, 100.91it/s]



Average Metric: 180.24 / 605 (29.8%):  40%|███▉      | 604/1526 [00:04<00:09, 100.91it/s]



Average Metric: 180.70 / 606 (29.8%):  40%|███▉      | 605/1526 [00:04<00:09, 100.91it/s]



Average Metric: 181.11 / 607 (29.8%):  40%|███▉      | 606/1526 [00:04<00:09, 100.91it/s]



Average Metric: 181.36 / 608 (29.8%):  40%|███▉      | 608/1526 [00:04<00:08, 109.37it/s]



Average Metric: 181.86 / 610 (29.8%):  40%|███▉      | 609/1526 [00:04<00:08, 109.37it/s]



Average Metric: 181.97 / 611 (29.8%):  40%|███▉      | 610/1526 [00:04<00:08, 109.37it/s]



Average Metric: 182.67 / 613 (29.8%):  40%|████      | 612/1526 [00:04<00:08, 109.37it/s]



Average Metric: 183.07 / 615 (29.8%):  40%|████      | 614/1526 [00:04<00:08, 109.37it/s]



Average Metric: 183.07 / 616 (29.7%):  40%|████      | 615/1526 [00:04<00:08, 109.37it/s]



Average Metric: 183.47 / 617 (29.7%):  40%|████      | 616/1526 [00:04<00:08, 109.37it/s]



Average Metric: 183.78 / 618 (29.7%):  40%|████      | 617/1526 [00:04<00:08, 109.37it/s]



Average Metric: 184.76 / 620 (29.8%):  41%|████      | 619/1526 [00:04<00:08, 109.37it/s]



Average Metric: 185.41 / 622 (29.8%):  41%|████      | 621/1526 [00:04<00:08, 109.37it/s]



Average Metric: 185.97 / 623 (29.9%):  41%|████      | 622/1526 [00:04<00:08, 109.37it/s]



Average Metric: 187.12 / 625 (29.9%):  41%|████      | 624/1526 [00:04<00:08, 109.37it/s]



Average Metric: 187.82 / 627 (30.0%):  41%|████      | 626/1526 [00:04<00:08, 109.37it/s]



Average Metric: 188.23 / 628 (30.0%):  41%|████      | 627/1526 [00:04<00:08, 109.37it/s]



Average Metric: 188.23 / 629 (29.9%):  41%|████      | 628/1526 [00:04<00:08, 109.37it/s]



Average Metric: 189.11 / 631 (30.0%):  41%|████▏     | 630/1526 [00:04<00:08, 109.37it/s]



Average Metric: 189.54 / 632 (30.0%):  41%|████▏     | 631/1526 [00:04<00:08, 109.37it/s]



Average Metric: 190.21 / 634 (30.0%):  41%|████▏     | 633/1526 [00:04<00:08, 109.37it/s]



Average Metric: 190.61 / 635 (30.0%):  42%|████▏     | 634/1526 [00:04<00:08, 109.37it/s]



Average Metric: 191.32 / 637 (30.0%):  42%|████▏     | 636/1526 [00:04<00:08, 109.37it/s]



Average Metric: 191.61 / 638 (30.0%):  42%|████▏     | 637/1526 [00:04<00:08, 109.37it/s]



Average Metric: 191.86 / 639 (30.0%):  42%|████▏     | 638/1526 [00:04<00:08, 109.37it/s]



Average Metric: 192.01 / 640 (30.0%):  42%|████▏     | 639/1526 [00:04<00:08, 109.37it/s]



Average Metric: 192.58 / 641 (30.0%):  42%|████▏     | 640/1526 [00:04<00:08, 109.37it/s]



Average Metric: 193.08 / 642 (30.1%):  42%|████▏     | 641/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.00 / 644 (30.1%):  42%|████▏     | 643/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.00 / 646 (30.0%):  42%|████▏     | 645/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.00 / 647 (30.0%):  42%|████▏     | 646/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.22 / 648 (30.0%):  42%|████▏     | 647/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.59 / 649 (30.0%):  42%|████▏     | 648/1526 [00:04<00:08, 109.37it/s]



Average Metric: 194.59 / 649 (30.0%):  43%|████▎     | 649/1526 [00:04<00:05, 162.42it/s]



Average Metric: 195.47 / 651 (30.0%):  43%|████▎     | 650/1526 [00:04<00:05, 162.42it/s]



Average Metric: 196.14 / 652 (30.1%):  43%|████▎     | 651/1526 [00:04<00:05, 162.42it/s]



Average Metric: 196.14 / 653 (30.0%):  43%|████▎     | 652/1526 [00:04<00:05, 162.42it/s]



Average Metric: 196.47 / 655 (30.0%):  43%|████▎     | 654/1526 [00:04<00:05, 162.42it/s]



Average Metric: 196.47 / 657 (29.9%):  43%|████▎     | 656/1526 [00:04<00:05, 162.42it/s]



Average Metric: 196.72 / 658 (29.9%):  43%|████▎     | 657/1526 [00:04<00:05, 162.42it/s]



Average Metric: 197.32 / 659 (29.9%):  43%|████▎     | 658/1526 [00:04<00:05, 162.42it/s]



Average Metric: 197.61 / 660 (29.9%):  43%|████▎     | 659/1526 [00:04<00:05, 162.42it/s]



Average Metric: 197.94 / 661 (29.9%):  43%|████▎     | 660/1526 [00:04<00:05, 162.42it/s]



Average Metric: 198.38 / 663 (29.9%):  43%|████▎     | 662/1526 [00:04<00:05, 162.42it/s]



Average Metric: 198.88 / 664 (30.0%):  43%|████▎     | 663/1526 [00:04<00:05, 162.42it/s]



Average Metric: 199.22 / 665 (30.0%):  44%|████▎     | 664/1526 [00:04<00:05, 162.42it/s]



Average Metric: 200.11 / 667 (30.0%):  44%|████▎     | 666/1526 [00:04<00:05, 162.42it/s]



Average Metric: 200.29 / 668 (30.0%):  44%|████▎     | 667/1526 [00:04<00:05, 162.42it/s]



Average Metric: 200.65 / 670 (29.9%):  44%|████▍     | 669/1526 [00:04<00:05, 162.42it/s]



Average Metric: 200.85 / 671 (29.9%):  44%|████▍     | 670/1526 [00:04<00:05, 162.42it/s]



Average Metric: 201.41 / 672 (30.0%):  44%|████▍     | 671/1526 [00:04<00:05, 162.42it/s]



Average Metric: 202.08 / 673 (30.0%):  44%|████▍     | 672/1526 [00:04<00:04, 174.75it/s]



Average Metric: 202.94 / 675 (30.1%):  44%|████▍     | 674/1526 [00:04<00:04, 174.75it/s]



Average Metric: 203.39 / 676 (30.1%):  44%|████▍     | 675/1526 [00:04<00:04, 174.75it/s]



Average Metric: 203.56 / 677 (30.1%):  44%|████▍     | 676/1526 [00:04<00:04, 174.75it/s]



Average Metric: 203.74 / 678 (30.0%):  44%|████▍     | 677/1526 [00:04<00:04, 174.75it/s]



Average Metric: 204.83 / 680 (30.1%):  44%|████▍     | 679/1526 [00:04<00:04, 174.75it/s]



Average Metric: 205.60 / 682 (30.1%):  45%|████▍     | 681/1526 [00:04<00:04, 174.75it/s]



Average Metric: 205.60 / 683 (30.1%):  45%|████▍     | 682/1526 [00:04<00:04, 174.75it/s]



Average Metric: 206.55 / 685 (30.2%):  45%|████▍     | 684/1526 [00:04<00:04, 174.75it/s]



Average Metric: 206.72 / 686 (30.1%):  45%|████▍     | 685/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.05 / 687 (30.1%):  45%|████▍     | 686/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.05 / 688 (30.1%):  45%|████▌     | 687/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.05 / 689 (30.1%):  45%|████▌     | 688/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.05 / 691 (30.0%):  45%|████▌     | 690/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.30 / 693 (29.9%):  45%|████▌     | 692/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.30 / 694 (29.9%):  45%|████▌     | 693/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.55 / 695 (29.9%):  45%|████▌     | 694/1526 [00:04<00:04, 174.75it/s]



Average Metric: 207.80 / 696 (29.9%):  46%|████▌     | 695/1526 [00:04<00:04, 174.75it/s]



Average Metric: 208.12 / 697 (29.9%):  46%|████▌     | 696/1526 [00:04<00:04, 174.75it/s]



Average Metric: 208.84 / 698 (29.9%):  46%|████▌     | 697/1526 [00:04<00:04, 174.75it/s]



Average Metric: 209.44 / 699 (30.0%):  46%|████▌     | 698/1526 [00:04<00:04, 174.75it/s]



Average Metric: 209.44 / 700 (29.9%):  46%|████▌     | 699/1526 [00:04<00:04, 174.75it/s]



Average Metric: 209.63 / 701 (29.9%):  46%|████▌     | 700/1526 [00:04<00:04, 174.75it/s]



Average Metric: 209.63 / 702 (29.9%):  46%|████▌     | 702/1526 [00:04<00:04, 201.45it/s]



Average Metric: 210.23 / 703 (29.9%):  46%|████▌     | 702/1526 [00:04<00:04, 201.45it/s]



Average Metric: 210.23 / 704 (29.9%):  46%|████▌     | 703/1526 [00:04<00:04, 201.45it/s]



Average Metric: 210.44 / 705 (29.9%):  46%|████▌     | 704/1526 [00:04<00:04, 201.45it/s]



Average Metric: 210.84 / 706 (29.9%):  46%|████▌     | 705/1526 [00:04<00:04, 201.45it/s]



Average Metric: 211.70 / 708 (29.9%):  46%|████▋     | 707/1526 [00:04<00:04, 201.45it/s]



Average Metric: 212.20 / 709 (29.9%):  46%|████▋     | 708/1526 [00:04<00:04, 201.45it/s]



Average Metric: 212.70 / 710 (30.0%):  46%|████▋     | 709/1526 [00:04<00:04, 201.45it/s]



Average Metric: 212.70 / 711 (29.9%):  47%|████▋     | 710/1526 [00:04<00:04, 201.45it/s]



Average Metric: 213.27 / 712 (30.0%):  47%|████▋     | 711/1526 [00:04<00:04, 201.45it/s]



Average Metric: 213.27 / 713 (29.9%):  47%|████▋     | 712/1526 [00:04<00:04, 201.45it/s]



Average Metric: 213.56 / 715 (29.9%):  47%|████▋     | 714/1526 [00:04<00:04, 201.45it/s]



Average Metric: 213.65 / 716 (29.8%):  47%|████▋     | 715/1526 [00:04<00:04, 201.45it/s]



Average Metric: 213.94 / 717 (29.8%):  47%|████▋     | 716/1526 [00:04<00:04, 201.45it/s]



Average Metric: 214.91 / 719 (29.9%):  47%|████▋     | 718/1526 [00:04<00:04, 201.45it/s]



Average Metric: 214.91 / 721 (29.8%):  47%|████▋     | 720/1526 [00:04<00:04, 201.45it/s]



Average Metric: 215.11 / 722 (29.8%):  47%|████▋     | 721/1526 [00:04<00:03, 201.45it/s]



Average Metric: 215.11 / 723 (29.8%):  47%|████▋     | 722/1526 [00:04<00:03, 201.45it/s]



Average Metric: 215.27 / 724 (29.7%):  47%|████▋     | 723/1526 [00:04<00:03, 201.45it/s]



Average Metric: 215.60 / 726 (29.7%):  48%|████▊     | 725/1526 [00:04<00:03, 201.45it/s]



Average Metric: 215.77 / 727 (29.7%):  48%|████▊     | 726/1526 [00:04<00:03, 201.45it/s]



Average Metric: 215.77 / 727 (29.7%):  48%|████▊     | 727/1526 [00:04<00:03, 208.36it/s]



Average Metric: 216.21 / 729 (29.7%):  48%|████▊     | 728/1526 [00:04<00:03, 208.36it/s]



Average Metric: 217.09 / 731 (29.7%):  48%|████▊     | 730/1526 [00:04<00:03, 208.36it/s]



Average Metric: 217.09 / 732 (29.7%):  48%|████▊     | 731/1526 [00:04<00:03, 208.36it/s]



Average Metric: 217.50 / 733 (29.7%):  48%|████▊     | 732/1526 [00:04<00:03, 208.36it/s]



Average Metric: 217.70 / 734 (29.7%):  48%|████▊     | 733/1526 [00:04<00:03, 208.36it/s]



Average Metric: 217.93 / 735 (29.6%):  48%|████▊     | 734/1526 [00:04<00:03, 208.36it/s]



Average Metric: 218.43 / 737 (29.6%):  48%|████▊     | 736/1526 [00:04<00:03, 208.36it/s]



Average Metric: 218.65 / 738 (29.6%):  48%|████▊     | 737/1526 [00:04<00:03, 208.36it/s]



Average Metric: 219.15 / 740 (29.6%):  48%|████▊     | 739/1526 [00:04<00:03, 208.36it/s]



Average Metric: 219.33 / 741 (29.6%):  48%|████▊     | 740/1526 [00:04<00:03, 208.36it/s]



Average Metric: 219.58 / 742 (29.6%):  49%|████▊     | 741/1526 [00:04<00:03, 208.36it/s]



Average Metric: 219.58 / 743 (29.6%):  49%|████▊     | 742/1526 [00:04<00:03, 208.36it/s]



Average Metric: 221.04 / 745 (29.7%):  49%|████▉     | 744/1526 [00:04<00:03, 208.36it/s]



Average Metric: 221.04 / 746 (29.6%):  49%|████▉     | 745/1526 [00:04<00:03, 208.36it/s]



Average Metric: 222.42 / 748 (29.7%):  49%|████▉     | 747/1526 [00:04<00:03, 208.36it/s]



Average Metric: 222.97 / 750 (29.7%):  49%|████▉     | 749/1526 [00:04<00:03, 208.36it/s]



Average Metric: 222.97 / 751 (29.7%):  49%|████▉     | 750/1526 [00:04<00:03, 208.36it/s]



Average Metric: 223.47 / 752 (29.7%):  49%|████▉     | 751/1526 [00:04<00:03, 208.36it/s]



Average Metric: 223.72 / 754 (29.7%):  49%|████▉     | 753/1526 [00:04<00:03, 208.36it/s]



Average Metric: 224.39 / 755 (29.7%):  49%|████▉     | 754/1526 [00:04<00:03, 208.36it/s]



Average Metric: 224.59 / 757 (29.7%):  50%|████▉     | 756/1526 [00:04<00:03, 208.36it/s]



Average Metric: 224.59 / 759 (29.6%):  50%|████▉     | 758/1526 [00:04<00:03, 208.36it/s]



Average Metric: 224.59 / 759 (29.6%):  50%|████▉     | 759/1526 [00:04<00:03, 236.18it/s]



Average Metric: 224.59 / 760 (29.6%):  50%|████▉     | 759/1526 [00:04<00:03, 236.18it/s]



Average Metric: 224.75 / 762 (29.5%):  50%|████▉     | 761/1526 [00:04<00:03, 236.18it/s]



Average Metric: 224.75 / 763 (29.5%):  50%|████▉     | 762/1526 [00:04<00:03, 236.18it/s]



Average Metric: 225.57 / 765 (29.5%):  50%|█████     | 764/1526 [00:04<00:03, 236.18it/s]



Average Metric: 225.86 / 766 (29.5%):  50%|█████     | 765/1526 [00:04<00:03, 236.18it/s]



Average Metric: 225.86 / 767 (29.4%):  50%|█████     | 766/1526 [00:04<00:03, 236.18it/s]



Average Metric: 226.66 / 769 (29.5%):  50%|█████     | 768/1526 [00:04<00:03, 236.18it/s]



Average Metric: 226.66 / 770 (29.4%):  50%|█████     | 769/1526 [00:04<00:03, 236.18it/s]



Average Metric: 226.88 / 771 (29.4%):  50%|█████     | 770/1526 [00:04<00:03, 236.18it/s]



Average Metric: 227.60 / 773 (29.4%):  51%|█████     | 772/1526 [00:04<00:03, 236.18it/s]



Average Metric: 228.00 / 775 (29.4%):  51%|█████     | 774/1526 [00:04<00:03, 236.18it/s]



Average Metric: 229.06 / 777 (29.5%):  51%|█████     | 776/1526 [00:04<00:03, 236.18it/s]



Average Metric: 229.06 / 778 (29.4%):  51%|█████     | 777/1526 [00:04<00:03, 236.18it/s]



Average Metric: 229.06 / 779 (29.4%):  51%|█████     | 778/1526 [00:04<00:03, 236.18it/s]



Average Metric: 229.29 / 780 (29.4%):  51%|█████     | 779/1526 [00:04<00:03, 236.18it/s]



Average Metric: 229.57 / 781 (29.4%):  51%|█████     | 780/1526 [00:04<00:03, 236.18it/s]



Average Metric: 230.25 / 783 (29.4%):  51%|█████     | 782/1526 [00:04<00:03, 236.18it/s]



Average Metric: 230.50 / 784 (29.4%):  51%|█████▏    | 783/1526 [00:04<00:03, 236.18it/s]



Average Metric: 230.84 / 785 (29.4%):  51%|█████▏    | 784/1526 [00:04<00:03, 236.18it/s]



Average Metric: 231.27 / 786 (29.4%):  51%|█████▏    | 785/1526 [00:04<00:03, 236.18it/s]



Average Metric: 232.07 / 788 (29.5%):  52%|█████▏    | 787/1526 [00:04<00:03, 236.18it/s]



Average Metric: 232.07 / 788 (29.5%):  52%|█████▏    | 788/1526 [00:04<00:02, 247.79it/s]



Average Metric: 232.46 / 790 (29.4%):  52%|█████▏    | 789/1526 [00:04<00:02, 247.79it/s]



Average Metric: 232.84 / 792 (29.4%):  52%|█████▏    | 791/1526 [00:04<00:02, 247.79it/s]



Average Metric: 232.84 / 793 (29.4%):  52%|█████▏    | 792/1526 [00:04<00:02, 247.79it/s]



Average Metric: 233.07 / 794 (29.4%):  52%|█████▏    | 793/1526 [00:04<00:02, 247.79it/s]



Average Metric: 233.63 / 796 (29.4%):  52%|█████▏    | 795/1526 [00:04<00:02, 247.79it/s]



Average Metric: 234.03 / 797 (29.4%):  52%|█████▏    | 796/1526 [00:04<00:02, 247.79it/s]



Average Metric: 234.89 / 798 (29.4%):  52%|█████▏    | 797/1526 [00:04<00:02, 247.79it/s]



Average Metric: 235.25 / 799 (29.4%):  52%|█████▏    | 798/1526 [00:04<00:02, 247.79it/s]



Average Metric: 235.65 / 800 (29.5%):  52%|█████▏    | 799/1526 [00:04<00:02, 247.79it/s]



Average Metric: 235.90 / 801 (29.5%):  52%|█████▏    | 800/1526 [00:04<00:02, 247.79it/s]



Average Metric: 236.36 / 802 (29.5%):  52%|█████▏    | 801/1526 [00:04<00:02, 247.79it/s]



Average Metric: 236.82 / 803 (29.5%):  53%|█████▎    | 802/1526 [00:04<00:02, 247.79it/s]



Average Metric: 237.11 / 804 (29.5%):  53%|█████▎    | 803/1526 [00:04<00:02, 247.79it/s]



Average Metric: 237.39 / 805 (29.5%):  53%|█████▎    | 804/1526 [00:04<00:02, 247.79it/s]



Average Metric: 237.56 / 807 (29.4%):  53%|█████▎    | 806/1526 [00:04<00:02, 247.79it/s]



Average Metric: 238.50 / 809 (29.5%):  53%|█████▎    | 808/1526 [00:04<00:02, 247.79it/s]



Average Metric: 239.00 / 810 (29.5%):  53%|█████▎    | 809/1526 [00:04<00:02, 247.79it/s]



Average Metric: 239.50 / 811 (29.5%):  53%|█████▎    | 810/1526 [00:04<00:02, 247.79it/s]



Average Metric: 240.02 / 813 (29.5%):  53%|█████▎    | 812/1526 [00:04<00:02, 247.79it/s]



Average Metric: 240.02 / 814 (29.5%):  53%|█████▎    | 813/1526 [00:04<00:02, 247.79it/s]



Average Metric: 240.31 / 815 (29.5%):  53%|█████▎    | 814/1526 [00:04<00:02, 247.79it/s]



Average Metric: 240.31 / 815 (29.5%):  53%|█████▎    | 815/1526 [00:04<00:02, 249.42it/s]



Average Metric: 240.56 / 817 (29.4%):  53%|█████▎    | 816/1526 [00:04<00:02, 249.42it/s]



Average Metric: 240.56 / 818 (29.4%):  54%|█████▎    | 817/1526 [00:04<00:02, 249.42it/s]



Average Metric: 240.84 / 819 (29.4%):  54%|█████▎    | 818/1526 [00:04<00:02, 249.42it/s]



Average Metric: 241.44 / 821 (29.4%):  54%|█████▎    | 820/1526 [00:04<00:02, 249.42it/s]



Average Metric: 241.44 / 822 (29.4%):  54%|█████▍    | 821/1526 [00:04<00:02, 249.42it/s]



Average Metric: 241.73 / 823 (29.4%):  54%|█████▍    | 822/1526 [00:04<00:02, 249.42it/s]



Average Metric: 242.17 / 824 (29.4%):  54%|█████▍    | 823/1526 [00:04<00:02, 249.42it/s]



Average Metric: 242.62 / 825 (29.4%):  54%|█████▍    | 824/1526 [00:05<00:02, 249.42it/s]



Average Metric: 242.87 / 827 (29.4%):  54%|█████▍    | 826/1526 [00:05<00:02, 249.42it/s]



Average Metric: 242.87 / 828 (29.3%):  54%|█████▍    | 827/1526 [00:05<00:02, 249.42it/s]



Average Metric: 243.44 / 829 (29.4%):  54%|█████▍    | 828/1526 [00:05<00:02, 249.42it/s]



Average Metric: 243.44 / 830 (29.3%):  54%|█████▍    | 829/1526 [00:05<00:02, 249.42it/s]



Average Metric: 243.72 / 831 (29.3%):  54%|█████▍    | 830/1526 [00:05<00:02, 249.42it/s]



Average Metric: 244.30 / 833 (29.3%):  55%|█████▍    | 832/1526 [00:05<00:02, 249.42it/s]



Average Metric: 244.80 / 834 (29.4%):  55%|█████▍    | 833/1526 [00:05<00:02, 249.42it/s]



Average Metric: 245.13 / 835 (29.4%):  55%|█████▍    | 834/1526 [00:05<00:02, 249.42it/s]



Average Metric: 245.41 / 836 (29.4%):  55%|█████▍    | 835/1526 [00:05<00:02, 249.42it/s]



Average Metric: 245.75 / 837 (29.4%):  55%|█████▍    | 836/1526 [00:05<00:02, 249.42it/s]



Average Metric: 246.03 / 838 (29.4%):  55%|█████▍    | 837/1526 [00:05<00:02, 249.42it/s]



Average Metric: 246.24 / 839 (29.3%):  55%|█████▍    | 838/1526 [00:05<00:02, 249.42it/s]



Average Metric: 246.49 / 841 (29.3%):  55%|█████▌    | 840/1526 [00:05<00:02, 249.42it/s]



Average Metric: 247.15 / 842 (29.4%):  55%|█████▌    | 841/1526 [00:05<00:02, 249.42it/s]



Average Metric: 247.93 / 844 (29.4%):  55%|█████▌    | 843/1526 [00:05<00:02, 249.42it/s]



Average Metric: 248.54 / 845 (29.4%):  55%|█████▌    | 845/1526 [00:05<00:02, 263.39it/s]



Average Metric: 248.54 / 846 (29.4%):  55%|█████▌    | 845/1526 [00:05<00:02, 263.39it/s]



Average Metric: 248.79 / 848 (29.3%):  56%|█████▌    | 847/1526 [00:05<00:02, 263.39it/s]



Average Metric: 249.07 / 849 (29.3%):  56%|█████▌    | 848/1526 [00:05<00:02, 263.39it/s]



Average Metric: 249.52 / 850 (29.4%):  56%|█████▌    | 849/1526 [00:05<00:02, 263.39it/s]



Average Metric: 249.77 / 852 (29.3%):  56%|█████▌    | 851/1526 [00:05<00:02, 263.39it/s]



Average Metric: 249.77 / 853 (29.3%):  56%|█████▌    | 852/1526 [00:05<00:02, 263.39it/s]



Average Metric: 249.77 / 854 (29.2%):  56%|█████▌    | 853/1526 [00:05<00:02, 263.39it/s]



Average Metric: 250.43 / 856 (29.3%):  56%|█████▌    | 855/1526 [00:05<00:02, 263.39it/s]



Average Metric: 251.23 / 857 (29.3%):  56%|█████▌    | 856/1526 [00:05<00:02, 263.39it/s]



Average Metric: 251.23 / 859 (29.2%):  56%|█████▌    | 858/1526 [00:05<00:02, 263.39it/s]



Average Metric: 251.73 / 861 (29.2%):  56%|█████▋    | 860/1526 [00:05<00:02, 263.39it/s]



Average Metric: 251.93 / 862 (29.2%):  56%|█████▋    | 861/1526 [00:05<00:02, 263.39it/s]



Average Metric: 252.18 / 864 (29.2%):  57%|█████▋    | 863/1526 [00:05<00:02, 263.39it/s]



Average Metric: 252.65 / 865 (29.2%):  57%|█████▋    | 864/1526 [00:05<00:02, 263.39it/s]



Average Metric: 252.98 / 866 (29.2%):  57%|█████▋    | 865/1526 [00:05<00:02, 263.39it/s]



Average Metric: 253.20 / 867 (29.2%):  57%|█████▋    | 866/1526 [00:05<00:02, 263.39it/s]



Average Metric: 253.47 / 868 (29.2%):  57%|█████▋    | 867/1526 [00:05<00:02, 263.39it/s]



Average Metric: 253.47 / 870 (29.1%):  57%|█████▋    | 869/1526 [00:05<00:02, 263.39it/s]



Average Metric: 253.75 / 872 (29.1%):  57%|█████▋    | 871/1526 [00:05<00:02, 263.39it/s]



Average Metric: 254.15 / 873 (29.1%):  57%|█████▋    | 872/1526 [00:05<00:02, 263.39it/s]



Average Metric: 254.15 / 873 (29.1%):  57%|█████▋    | 873/1526 [00:05<00:02, 263.97it/s]



Average Metric: 254.29 / 874 (29.1%):  57%|█████▋    | 873/1526 [00:05<00:02, 263.97it/s]



Average Metric: 254.29 / 875 (29.1%):  57%|█████▋    | 874/1526 [00:05<00:02, 263.97it/s]



Average Metric: 254.90 / 877 (29.1%):  57%|█████▋    | 876/1526 [00:05<00:02, 263.97it/s]



Average Metric: 255.08 / 878 (29.1%):  57%|█████▋    | 877/1526 [00:05<00:02, 263.97it/s]



Average Metric: 255.26 / 879 (29.0%):  58%|█████▊    | 878/1526 [00:05<00:02, 263.97it/s]



Average Metric: 255.49 / 880 (29.0%):  58%|█████▊    | 879/1526 [00:05<00:02, 263.97it/s]



Average Metric: 255.71 / 882 (29.0%):  58%|█████▊    | 881/1526 [00:05<00:02, 263.97it/s]



Average Metric: 255.71 / 883 (29.0%):  58%|█████▊    | 882/1526 [00:05<00:02, 263.97it/s]



Average Metric: 256.15 / 884 (29.0%):  58%|█████▊    | 883/1526 [00:05<00:02, 263.97it/s]



Average Metric: 256.48 / 885 (29.0%):  58%|█████▊    | 884/1526 [00:05<00:02, 263.97it/s]



Average Metric: 256.48 / 886 (28.9%):  58%|█████▊    | 885/1526 [00:05<00:02, 263.97it/s]



Average Metric: 256.89 / 887 (29.0%):  58%|█████▊    | 886/1526 [00:05<00:02, 263.97it/s]



Average Metric: 257.22 / 888 (29.0%):  58%|█████▊    | 887/1526 [00:05<00:02, 263.97it/s]



Average Metric: 257.79 / 890 (29.0%):  58%|█████▊    | 889/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.13 / 891 (29.0%):  58%|█████▊    | 890/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 892 (29.0%):  58%|█████▊    | 891/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 893 (29.0%):  58%|█████▊    | 892/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 894 (29.0%):  59%|█████▊    | 893/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 896 (28.9%):  59%|█████▊    | 895/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 897 (28.9%):  59%|█████▊    | 896/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 898 (28.8%):  59%|█████▉    | 897/1526 [00:05<00:02, 263.97it/s]



Average Metric: 258.88 / 899 (28.8%):  59%|█████▉    | 898/1526 [00:05<00:02, 263.97it/s]



Average Metric: 259.38 / 900 (28.8%):  59%|█████▉    | 899/1526 [00:05<00:02, 263.97it/s]



Average Metric: 259.71 / 902 (28.8%):  59%|█████▉    | 901/1526 [00:05<00:02, 216.94it/s]



Average Metric: 259.87 / 904 (28.7%):  59%|█████▉    | 903/1526 [00:05<00:02, 216.94it/s]



Average Metric: 259.87 / 906 (28.7%):  59%|█████▉    | 905/1526 [00:05<00:02, 216.94it/s]



Average Metric: 260.18 / 907 (28.7%):  59%|█████▉    | 906/1526 [00:05<00:02, 216.94it/s]



Average Metric: 260.86 / 909 (28.7%):  60%|█████▉    | 908/1526 [00:05<00:02, 216.94it/s]



Average Metric: 261.32 / 911 (28.7%):  60%|█████▉    | 910/1526 [00:05<00:02, 216.94it/s]



Average Metric: 261.58 / 912 (28.7%):  60%|█████▉    | 911/1526 [00:05<00:02, 216.94it/s]



Average Metric: 261.58 / 913 (28.7%):  60%|█████▉    | 912/1526 [00:05<00:02, 216.94it/s]



Average Metric: 262.58 / 915 (28.7%):  60%|█████▉    | 914/1526 [00:05<00:02, 216.94it/s]



Average Metric: 262.58 / 916 (28.7%):  60%|█████▉    | 915/1526 [00:05<00:02, 216.94it/s]



Average Metric: 263.17 / 918 (28.7%):  60%|██████    | 917/1526 [00:05<00:02, 216.94it/s]



Average Metric: 263.43 / 919 (28.7%):  60%|██████    | 918/1526 [00:05<00:02, 216.94it/s]



Average Metric: 263.72 / 920 (28.7%):  60%|██████    | 919/1526 [00:05<00:02, 216.94it/s]



Average Metric: 264.05 / 922 (28.6%):  60%|██████    | 921/1526 [00:05<00:02, 216.94it/s]



Average Metric: 264.05 / 923 (28.6%):  60%|██████    | 922/1526 [00:05<00:02, 216.94it/s]



Average Metric: 264.43 / 924 (28.6%):  60%|██████    | 923/1526 [00:05<00:02, 216.94it/s]



Average Metric: 264.71 / 925 (28.6%):  61%|██████    | 924/1526 [00:05<00:02, 216.94it/s]



Average Metric: 264.85 / 927 (28.6%):  61%|██████    | 926/1526 [00:05<00:02, 216.94it/s]



Average Metric: 265.55 / 929 (28.6%):  61%|██████    | 928/1526 [00:05<00:02, 216.94it/s]



Average Metric: 266.41 / 931 (28.6%):  61%|██████    | 930/1526 [00:05<00:02, 216.94it/s]



Average Metric: 267.01 / 932 (28.6%):  61%|██████    | 931/1526 [00:05<00:02, 216.94it/s]



Average Metric: 267.23 / 933 (28.6%):  61%|██████    | 932/1526 [00:05<00:02, 216.94it/s]



Average Metric: 267.45 / 934 (28.6%):  61%|██████    | 933/1526 [00:05<00:02, 216.94it/s]



Average Metric: 267.45 / 935 (28.6%):  61%|██████    | 934/1526 [00:05<00:02, 216.94it/s]



Average Metric: 457.89 / 1526 (30.0%): 100%|██████████| 1526/1526 [00:05<00:00, 264.10it/s] 

2025/08/24 14:35:35 INFO dspy.evaluate.evaluate: Average Metric: 457.885987342008 / 1526 (30.0%)





Unnamed: 0,context,question,community,example_conversation_history,example_response,pred_response,student_goal,pragmatic_need,pred_conversation_history,reasoning,SemanticF1
0,Freddy Krueger General information Age ? (at the time of physical ...,who is freddy krueger?,A Nightmare on Elm Street,No prior conversation context,Freddy Kruger is the nightmare in nighmare on Elm street. Please n...,Freddy Krueger is the main antagonist in the A Nightmare on Elm St...,"To understand the character of Freddy Krueger, including his origi...",To provide a factual and engaging summary of Freddy Krueger's iden...,No prior conversation context,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.182]
1,Spencer Lewis General information Age 16 (At the time of his death...,"oh man, that sucks.",A Nightmare on Elm Street,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,"Yes and no, it means I can be lighting quick, especially since I t...",I'm sorry to hear you say that—it sounds like the dark and tragic ...,To learn more detailed and accurate information about Freddy Krueg...,The user needs assistance in overcoming the access issue to Wikipe...,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
2,"Renny Harlin Information Date of birth March 15, 1959 (Age 53) Fil...",haha that is right.. more hourly rules!,A Nightmare on Elm Street,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,"So, you know the drill with Freddie Right? He's played by Robert E...","Haha, you got it! The more we dive into Freddy's world and keep th...","To engage in a fun, casual conversation about ""Nightmare on Elm St...",To respond in a light-hearted manner that sustains the conversatio...,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.179]
3,"Renny Harlin Information Date of birth March 15, 1959 (Age 53) Fil...",haha i know,A Nightmare on Elm Street,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,"You don't want to fall asleep, because that's when he can get you.","Haha, yeah, totally! Freddy's got that classic scare factor—Robert...","To engage in casual and enjoyable discussions about horror films, ...","To receive light-hearted, immediate responses that build on the co...",Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
4,""" Dream Warriors "" is a song by heavy metal band, Dokken , written...",i know.. I will have to skip the ambien tonight,A Nightmare on Elm Street,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,"Well, to be clear, he isn't real. Robert England is actually a fai...","Hey, that's a smart move—skipping the Ambien tonight sounds like a...","To safely explore and discuss horror movie lore, particularly Fred...","To receive immediate, practical advice on staying awake and managi...",Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.179]
5,Johnny Depp Information Date of birth TBA Film(s) A Nightmare on E...,oh yeah? Which shows or movies?,A Nightmare on Elm Street,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,"Well, the last thing I saw him in, a week or so ago was Stranger t...","Great question! Robert Englund, the actor best known for playing F...",To learn about the additional television shows or movies that Robe...,The user needs a clear and concise list or examples of Robert Engl...,Q: who is freddy krueger? A: Freddy Kruger is the nightmare in nig...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
6,Johnny Depp Information Date of birth TBA Film(s) A Nightmare on E...,who was the star on this movie?,A Nightmare on Elm Street,No prior conversation context,"Robert Englund IS Freddy Kruger, the bad guy for these films. Note...","In the movie ""A Nightmare on Elm Street,"" Johnny Depp starred as G...",To identify and understand the main actors in classic horror films...,To provide a direct and accurate identification of the movie's sta...,No prior conversation context,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
7,"Kelly Rowland Information Date of birth February 11, 1981 (age 31)...",great.. that sounds like you are very devoted.,A Nightmare on Elm Street,Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,"I try very hard, though my efforts sometimes come to nothing, as o...","Thanks, I'm glad you're excited—I'm here to dive deeper into the A...",To explore and understand the key elements of the A Nightmare on E...,To ensure quick and reliable responses to user queries about the t...,Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,✔️ [0.250]
8,This is about the Loretta Krueger of the A Nightmare on Elm Street...,yes I would love to know more about it.,A Nightmare on Elm Street,Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,Well our horror masterpiece was written by Wesley Strick and Eric ...,That's great to hear you're excited to dive deeper into the A Nigh...,"To learn more about the A Nightmare on Elm Street franchise, inclu...",To clarify what specific aspects of the A Nightmare on Elm Street ...,Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,
9,Johnny Depp Information Date of birth TBA Film(s) A Nightmare on E...,awesome.. which of those actors are your favorite?,A Nightmare on Elm Street,Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,"Patricia Rooney Mara is probably a fave of mine, She's been in som...","That's a great question! As an AI, I don't have personal favorites...",To explore and learn about actors in the Nightmare on Elm Street s...,"To provide a neutral, informative response that highlights key fac...",Q: who was the star on this movie? A: Robert Englund IS Freddy Kru...,Turn 1: Intermediary reasoning generated | Turn 2: Final answer re...,


In [34]:
print(f"\n✅ LLM_RAG SemanticF1 Score: {score:.4f}")
print(f"Total examples evaluated: {len(eval_examples)}")
print(f"Sample outputs:")
for i, output in enumerate(outputs[:3]):
    print(f"Example {i+1}:")
    print(output)


✅ LLM_RAG SemanticF1 Score: 30.0100
Total examples evaluated: 1526
Sample outputs:
Example 1:
(Example({'context': '\n\n\n\n\n\n\n     Freddy Krueger\n    \n\n\n      General information\n     \n\n\n       Age\n      \n\n       ? (at the time of physical death)\n      \n\n\n\n       Relationships\n      \n\n       unknown\n      \n\n\n\n       Occupation\n      \n\n       Groundskeeper (when he was alive)\n       \n        Child Molester (when he was alive)\n        Serial Killer (confirmed after death, not stated either way when he was alive)\n       \n\n\n\n\n       Appearances\n      \n\n\n        2010 reboot\n       \n\n\n\n\n       Portrayal\n      \n\n\n        Jackie Earle Haley\n       \n\n\n\n\n\n\n   This version of\n   \n    Freddy Krueger\n   \n   is the main antagonist of the\n   \n\n     A Nightmare on Elm Street\n    \n\n\n    Reboot\n   \n   . In this film, Freddy appeared once again as a Dream Demon and a serial killer after his death.\n  \n\n   In life, he was a chil

### **Evaluating using the SemanticF1 method**
#### *NOTE: dspy.evaluate method gives the average score and does not provide precision , recall seperatley . And so I obtained the precision and recall using the SemanticF1 method. And since I alread have used the dspy.Evaluate method , the results were already cached so it was fast(instant).*

In [35]:
predictions= []
gt_list = []
for gt_example,pred_example,_ in outputs:
    print(f"GROUND TRUTH : {gt_example['response']}")
    print(f"PREDICTION : {pred_example['response']}")
    print("\n")
    predictions.append(pred_example)
    gt_list.append(gt_example)

results = evaluate_results(gt_list, predictions)

GROUND TRUTH : Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.
PREDICTION : Freddy Krueger is the main antagonist in the A Nightmare on Elm Street horror franchise, particularly highlighted in the 2010 reboot. Originally, he was a human groundskeeper at a preschool who was secretly a child molester, preying on young children in his community. After being discovered and killed by the parents of his victims in a vigilante act, Freddy transformed into a vengeful supernatural entity known as a "Dream Demon." This allows him to invade people's dreams and kill them in ways that manifest real-world injuries, seeking revenge on the teenagers he blames for his demise.

In the series, Freddy is significant as a horror trope representing th

Evaluating examples: 100%|██████████| 1526/1526 [00:00<00:00, 2029.83it/s]


In [36]:
import pandas as pd

df = pd.DataFrame([results])
display(df.style.hide(axis="index"))

precision,recall,f1,count
0.300056,0.300056,0.300056,1526


### **Saving the results to a json file**

In [38]:
outputs_dict = save_outputs(outputs)

import json
with open("llm_rag_outputs_with_history.jsonl", "w", encoding="utf-8") as f:
    json.dump(outputs_dict, f, indent=2, ensure_ascii=False)

print("✅ Saved outputs to llm_rag_outputs_with_history.json")

✅ Saved outputs to llm_rag_outputs_with_history.json


# **5. Discussion Questions**

### **Comparing TraditionalQA with LLM RAG-based**
##### 1.    *The LLM-based RAG approach provided more pragmatic answers than the Distillbert with retriever model in all of th answers.*

##### 2.    *From my presonal observation of the answers , the LLM based RAG approach provided more complete answers with more pragmatic information than the original ground truth answers.*

##### 3.    *Also from my analysis of the answers , the LLM based RAG approach provided more complete answers both litterally and pragmatically than the Distillbert with litteral/pragmatic spans as context.*

##### 4. *There are multiple possible follow-up questions that the LLM based RAG approach answers , but they may not be the exact follow-up questions that the ground truth pragmatic answers are aiming to answer. But still they are valid follow-up questions that can be asked in the conversation.*

NOTE: I repeted this pragraph since I think it is the same quesion as 4.4.1 (comparing results with the Distillbert model on the first questions)

### **What are the strengths and weaknesses of each approach for this pragmatic QA task ?**
##### *Strengths of LLM RAG-based approach*:
- ***Better at understanding and generating pragmatic answers.***
- ***Can incorporate broader context from retrieved documents to enhance answers.***
- ***More flexible in handling diverse question types and conversational nuances.***
##### *Weaknesses of LLM RAG-based approach*:
- ***Someimes overcontextualizes or adds irrelevant information that the reader may not have wanted.***
- ***Heavier computational requirements and latency due to large model size.***
- ***Potentially less straitforward than traditional models.***
##### *Strengths of Traditional QA approach (Distillbert with retriever)*:
- ***More efficient and faster inference due to smaller model size.***
##### *Weaknesses of Traditional QA approach (Distillbert with retriever)*:
- ***Struggles with pragmatic reasoning and understanding implicit intent.***
- ***Limited by its ability to extract information from the documents.***
- ***Less adaptable to conversational contexts and follow-up questions.***

##### *Overall, the LLM RAG-based approach is better suited for pragmatic QA tasks due to its ability to understand context and generate nuanced answers, while the Extractive QA approach is more limited in this regard but may be more efficient for straightforward extraction tasks.*

### **Do you observe a difference between the first question in a conversation and later questions?**
##### *Yes, there is a noticeable difference between the first question in a conversation and later questions. The first question often sets the context and tone for the conversation, and it may be more general or introductory. Later questions tend to be more specific and build upon the information provided in earlier exchanges. They may also require a deeper understanding of the context established by the initial question and subsequent answers. Additionally, later questions might involve more follow-up questions that seek clarification or additional details based on the ongoing dialogue.*

### ***Theory of Mind: To what extent do you think your LLM-based model exhibits a "Theory of Mind"? Does it truly "understand" the speaker's intent, or is it performing sophisticated pattern matching? Justify your answer with examples from your experiments.***

In [41]:
# Load the results form the JSON file (llm_rag_outputs_with_history.jsonl)
with open("llm_rag_outputs_with_history.jsonl", "r", encoding="utf-8") as f:
    loaded_results = json.load(f)

# Display the results
i=0
start = 20
# Skip conv till start
for conv in pcqa_val[:start]:
    i+= len(conv['qas'])

for j , conv in enumerate(pcqa_val[start:start + 1]):  # Display first 3 conversations for brevity
    print(f"\nConversation {start + j+1}: Community - {conv['community']}")
    for qas in conv['qas']:
        print(f"Q: {qas['q']}")
        print(f"Ground Truth Answer: {qas['a']}")
        print(f"LLM_RAG Predicted Answer: {loaded_results[i]['response']}")
        print(f"Conversation History: {loaded_results[i]['conversation_history']}")
        print(f"Student Goal: {loaded_results[i]['student_goal']}")
        print(f"Pragmatic Need: {loaded_results[i]['pragmatic_need']}")
        print("= " * 20)
        i+=1
    print("=" * 40)


Conversation 21: Community - Batman
Q: When was the original batman released?
Ground Truth Answer: The first appearance of Batman was in May 1939 in Detective Comics #27 title The Case of the Chemical Syndicate. In the comic, Commisioner Gordon learns that a chemical industrialist was murdered. Bruce Wayne, a young friend of Gordon, is present at the crime scene and decides to investigate as Batman.
LLM_RAG Predicted Answer: The original Batman was released in 1943 as a 15-chapter film serial produced by Columbia Pictures, starring Lewis Wilson as Batman and depicting him as a U.S. government spy during World War II.
Conversation History: No prior conversation context
Student Goal: The student's goal is to acquire accurate historical knowledge about the Batman franchise, specifically the timeline of its media releases, to build a foundational understanding of its origins in film and serials.
Pragmatic Need: The pragmatic need is to obtain a quick, factual answer for potential uses suc

##### *The LLM-based model does not have a true Theory of Mind—it doesn't have beliefs, intentions, or awareness of what others are thinking. Instead, it uses advanced pattern matching on huge amounts of text to *simulate* understanding.*

##### *The model can appear to grasp the speaker's intent, often by elaborating or giving advice. However, this is just because it has learned how people typically respond in similar situations—not because it actually understands the speaker's thoughts or goals.*

##### *The model gives the *illusion* of Theory of Mind, but it's really just statistical pattern matching, not genuine understanding.*

---

#### **Example from Experiments**

**Conversation 21: Community - Batman**  
- **Question:** When was the original Batman released?  
- **Ground Truth Answer:** The first appearance of Batman was in May 1939 in Detective Comics #27.  
- **LLM_RAG Predicted Answer:** The original Batman was released in 1943 as a 15-chapter film serial produced by Columbia Pictures.

##### *Here, the model missed the intent. The user was asking about the comic book debut, not the movie. A human with Theory of Mind would realize "original Batman" likely means the character's first comic appearance, not the film. The LLM, however, matched the question to a common pattern about movie releases.*

#### **If the model truly understood intent, it would have reasoned:**
    "The user asked about 'original Batman' without mentioning films, so they probably mean the comic debut."

**Instead, it gave a film answer because that's a pattern it has seen before.**

---

**Conclusion:**  
The LLM can mimic Theory of Mind by picking up on patterns and context, but it doesn't truly understand intent. Its responses are based on statistical associations, not real comprehension of the speaker's mental state.

# **Optional: Expirement with ColBERTv2 as retriever**

Tried using ColBERTv2 as a retriever but it was very slow to create the index , it ran more than hour and a half and still not finished so I had to stop it.

In [None]:
# %uv pip install colbert-ai
# %uv pip install setuptools

In [None]:
from dspy.dsp.colbertv2 import ColBERTv2RetrieverLocal
from colbert.infra.config import ColBERTConfig

def create_colbert_retrievers(corpuses_mapping):
    """Create ColBERTv2 retrievers for each community in the corpus mapping"""
    config = ColBERTConfig()            
    retrievers = {}
    for community, corpus in corpuses_mapping.items():
        if len(corpus) > 0:
            colbert_config = ColBERTConfig(
                checkpoint="colbert-ir/colbertv2.0",
                index_name=f"{community}_colbertv2_index",
                nbits=2,  # Compression bits
                kmeans_niters=4,  # K-means iterations
                nranks=1,  # Number of ranks (for single GPU/CPU)
            )
            retrievers[community] = ColBERTv2RetrieverLocal(
                passages=corpus,
                colbert_config=colbert_config
            )
    return retrievers

In [None]:

retrievers = create_colbert_retrievers(corpuses_mapping)

Building the index for experiment default with index name The Legend of Zelda_colbertv2_index


artifact.metadata: 0.00B [00:00, ?B/s]



[Aug 26, 14:59:04] #> Creating directory c:\Users\husse\Documents\courses\BGU\NLP with LLMs\hw3\nlp-with-llms-2025-hw3\experiments\default\indexes/The Legend of Zelda_colbertv2_index 




To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


#> Starting...
