## Function overview

- Step 0: Loading and chunking text
- Step 1: Filter contexts
    - W. Textdescriptives
    - W. LLM call, egnet til spørgsmål?
- Step 2: Generate questions
- Step 3: Filter generated questions
    - W. Text length
    - (Opt: LLM call: Is the answer clear and in a natural language?)
- Step 4: Relabel context-question pairs
    - Embed chunks, embed questions (Local Vector DB)
    - Use vector search to identify top k matches
    - If "source" context-question pair not in @1
        - Use LLM to check any context that scored > than "source" context
        - If context passes LLM check, update context-question pair labels to include additional context IDs

Not done yet
- Step 5: Convert to BEIR format

### Setting up Env

In [215]:
import os
from dotenv import load_dotenv
import pandas as pd
from datasets import Dataset
from datasets import load_dataset

load_dotenv(override=True)

True

### Downloading dataset

In [216]:
# Load from hub
ds_vejledninger = load_dataset(
    "jealk/dk_retrieval_benchmark",
    "retsinformation",
    split="train",
    #download_mode="force_redownload",
)

In [217]:
# Create pandas dataframe from the dataset using the huggingface datasets library
df_vejledninger = ds_vejledninger.to_pandas()
df_vejledninger.head()

Unnamed: 0,url,title,html_content,text_content
0,https://www.retsinformation.dk/eli/retsinfo/20...,Vejledning om regulering af satser fra 1. janu...,"<div class=""document-content"" id=""restylingRoo...",Vejledning om regulering af satser fra 1. janu...
1,https://www.retsinformation.dk/eli/retsinfo/20...,Vejledning om satser i 2024 for betaling af ud...,"<div class=""document-content"" id=""restylingRoo...",Vejledning om satser i 2024 for betaling af ud...
2,https://www.retsinformation.dk/eli/retsinfo/20...,Vejledning om obligatorisk selvbooking af jobs...,"<div class=""document-content"" id=""restylingRoo...",Vejledning om obligatorisk selvbooking af jobs...
3,https://www.retsinformation.dk/eli/retsinfo/20...,Vejledning til bekendtgørelse om tilskud til s...,"<div class=""document-content"" id=""restylingRoo...",Vejledning til bekendtgørelse om tilskud til s...
4,https://www.retsinformation.dk/eli/retsinfo/20...,Vejledning om fleksløntilskud m.v.,"<div class=""document-content"" id=""restylingRoo...",Vejledning om fleksløntilskud m.v.\n1.Indledni...


### Step 0: Chunking text data

In [218]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")

def token_length_function(text_input):
  return len(tokenizer.encode(text_input, add_special_tokens=False))

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap  = 0,
    length_function = token_length_function,
    separators = ["\n\n", "\n", ". ", "? ", "! "]
)

In [6]:
#For some reason, Langchains text splitter is horribly slow (compared to llamaindex) takes 2+ minutes to run on my CPU
split_documents = text_splitter.create_documents(list(df_vejledninger["text_content"]), metadatas = [{"title": title} for title in df_vejledninger["title"]])

### Step 1: Filtering contexts

#### Filtering using TextDescriptives

In [7]:
import textdescriptives as td
import spacy
from typing import List, Dict, Optional
import os

#add optional meta data, list of dicts
def filter_text_by_td(text_list: List[str], filter_type: bool=True) -> List[str]:
    """Filter nodes by the textdescriptives quality check

    Args:
    text_list> a list of stext strings
    fiter_type: A boolean defining whether to filter by texts that passed (True) or failed (False) the textdescriptives quality check

    Returns:
    A list of text chunks that passed the textdescriptives quality check
    """
    nlp = spacy.blank("da")
    nlp.add_pipe("sentencizer")
    quality_pipe = nlp.add_pipe("textdescriptives/quality")
    docs = list(nlp.pipe(text_list))
    filtered_texts = [doc.text for doc in docs if doc._.passed_quality_check==filter_type]
    
    return filtered_texts

In [219]:
from typing import List, Union
import spacy
from langchain_core.documents import Document  # Assuming Document is imported from here

def filter_text_by_td(text_list: List[Union[str, Document]], filter_type: bool=True) -> List[Document]:
    """
    Filter documents by the textdescriptives quality check, converts strings to langchain Docs

    Args:
        text_list (List[Union[str, Document]]): A list of text strings or Document objects.
        filter_type (bool): A boolean defining whether to filter by texts that passed (True) or failed (False) the textdescriptives quality check.

    Returns:
        List[Document]: A list of Document objects that passed the textdescriptives quality check.
    """
    nlp = spacy.blank("da")  # Assuming 'da' is the desired model
    nlp.add_pipe("sentencizer")
    quality_pipe = nlp.add_pipe("textdescriptives/quality")

    # Process the texts with SpaCy, handling both strings and Document objects
    processed_docs = list(nlp.pipe(doc.page_content if isinstance(doc, Document) else doc for doc in text_list))

    # Filter based on the quality check, merge with existing metadata
    filtered_docs = [Document(page_content=doc.text, metadata=getattr(original_doc, 'metadata', {}))
                     for original_doc, doc in zip(text_list, processed_docs) if doc._.passed_quality_check == filter_type]

    return filtered_docs


In [222]:
docs_passed_td = filter_text_by_td(split_documents[0:300], filter_type=True)

In [8]:
#Sample 300 texts
texts_passed_td = filter_text_by_td([text.page_content for text in split_documents[0:300]])
docs_passed_td = [doc for doc in split_documents if doc.page_content in texts_passed_td]

#### Filtering using LLM

In [223]:
import json
import logging
from typing import Dict, Any
from tqdm import tqdm  # Import tqdm

from openai import OpenAI
client = OpenAI()

def q_eval_system_prompt():
    sys_prompt = """Din opgave er at evaluere et givet tekstuddrag for at bestemme, om det er egnet til at danne grundlag for et generelt spørgsmål, der er relevant for eksempelvis en eksamen eller en test. 
    For at vurdere dette, skal du fokusere på følgende tre nøglekriterier:

    1. Klarhed: Vurder, om teksten er formuleret klart og direkte, således at et spørgsmål til denne tekst, vil kunne besvares uden yderligere forklaringer. Teksten skal være læsbar og ikke usammenhængende i sin struktur.
    
    2. Konkret Information: Afgør, om uddraget indeholder specifikke, faktuelle informationer, der kan danne grundlag for et præcist og direkte spørgsmål. Teksten skal præsentere håndgribelige fakta eller data, som et spørgsmål kan baseres på.

    3. Kontekstuel Helhed: Bedøm, om teksten leverer tilstrækkelig kontekst for at et spørgsmål baseret på uddraget vil være meningsfuldt og forståeligt uden behov for yderligere information. Teksten skal være selvstændig og give en fuld forståelse af det emne, der behandles.

    Baseret på din evaluering:

    - Tildel scoren 1, hvis tekstuddraget opfylder alle tre kriterier, og der kan formuleres et naturligt, klart og kontekstuelt meningsfuldt spørgsmål baseret på teksten.

    - Tildel scoren 0, hvis tekstuddraget ikke opfylder et eller flere af de ovenstående kriterier, hvilket gør det uegnet til at danne grundlag for et generelt spørgsmål.
    """
    return sys_prompt

def q_eval_user_prompt(text: str) -> str:
    """Prepare the prompt for the API call."""
    
    qa_egnet_tmlp = """Du er en erfaren sagsbehandler. 
    Din Opgave:
    Vurder det følgende tekstuddrag og angiv, om det er egnet til at stille et generelt spørgsmål til.

    Uddrag:
    {chunk_text}
    
    Returner din vurdering i følgende JSON-format:

    {{
    "llm_score": [indsæt enten 0 eller 1 her]
    }}
    """
    return qa_egnet_tmlp.format(chunk_text=text)


def json_api_call(system_prompt: str, user_prompt: str, oai_model: str="gpt-3.5-turbo-0125") -> Dict[str, Any]:
    """Perform the API call to evaluate the text."""
    try:
        completion = client.chat.completions.create(
            model=oai_model,
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": system_prompt
                },
                {
                    "role": "user", 
                    "content": user_prompt
                },
            ],
            response_format={"type": "json_object"}
        )
        return json.loads(completion.choices[0].message.content)
    except json.JSONDecodeError as e:
        logging.error(f'JSON parsing failed: {e}')
    except Exception as e:
        logging.error(f'API call failed: {e}')
    return {}


def filter_text_by_llm(text_list: List[Union[str, Document]]) -> List[Document]:
    """Filter text chunks by an LLM quality check

    Args:
        text_list (List[Union[str, Document]]): A list of text strings or Document objects.

    Returns:
        List[Document]: A list of Document objects that passed the LLM quality check.
    """
    texts_passed_llm = []
    system_prompt = q_eval_system_prompt()
    for text_item in tqdm(text_list, desc="Evaluating texts"):
        # Extract text content from Document objects or use string directly
        text_content = text_item.page_content if isinstance(text_item, Document) else text_item
        
        user_prompt = q_eval_user_prompt(text_content)
        response = json_api_call(system_prompt, user_prompt)
        if response:
            if response.get('llm_score') == 1:
                # Preserve original Document object or create a new one if the input was a string
                passed_text_doc = text_item if isinstance(text_item, Document) else Document(page_content=text_content)
                texts_passed_llm.append(passed_text_doc)
            else:
                continue
        else:
            logging.error(f'Failed to evaluate the following text due to an earlier error:\n{text_content}')

    return texts_passed_llm

In [229]:
#Sample just 50 texts
docs_passed_llm = filter_text_by_llm(docs_passed_td[:50])

Evaluating texts: 100%|██████████| 50/50 [00:43<00:00,  1.15it/s]


### Step 2: Generating Questions

In [226]:
def generate_question_template(text: str, num_q: int=1) -> str:
    question_tmlp = """Nedenfor er et uddrag (kontekst) fra en længere tekst:
    ---------------------
    {context_str}
    ---------------------
    Givet ovenstående uddrag og ingen forudgående viden, er din opgave at generere præcis {num_questions_per_chunk} spørgsmål til teksten.
    En sætning skal kun indeholde 1 spørgsmål, og spørgsmålet skal være formuleret kort og præcist. 
    Svaret til spørgsmålet, skal kunne findes i ovenstående uddrag.
    Spørgsmålet skal indeholde specifik kontekst, således at spørgsmålet efterfølgende kan besvares entydigt og uden kendskab til uddraget. 
    Spørgsmålene skal stilles i et sprog som en borger uden juridisk ekspertise kan forstå.

    Eksempel på et spørgsmål der ikke har en specifik kontekst, og som fejlagtigt indeholder 2 spørgsmål i 1 sætning: 
    "Hvilket dokument har den nye vejledning erstattet, og hvornår blev den udsendt?" -Da det ikke angivet hvilket dokument der er tale om, og derfor er svaret til spørgsmålet ikke entyidgt, uden kendskab til uddraget. Sætningen indeholder desuden 2 spørgsmål i samme sætning. 

    Eksempel på et godt spørgsmål, som kan besvares entydigt uden kendskab til uddraget:
    "Hvilke to indbetalinger udgør det samlede medlemsbidrag til en a-kasse?" - Da det er klart hvad der spørges om, og der kun er 1 rigtigt svar i den givne lovtekst.
    """
    return question_tmlp.format(context_str=text, num_questions_per_chunk=num_q)

In [227]:
def question_api_call(user_prompt: str, oai_model: str="gpt-4-0125-preview") -> Dict[str, Any]:
    """Perform the API call to evaluate the text."""
    try:
        completion = client.chat.completions.create(
            model=oai_model,
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": "Din opgave er at stille præcise spørgsmål til et givet tekstuddrag og returnere en JSON med en liste af spørgsmål i formatet {{Q: [spørgsmål1, spørsmål2, ...}}."
                },
                {
                    "role": "user", 
                    "content": user_prompt
                },
            ],
            response_format={"type": "json_object"}
        )
        return json.loads(completion.choices[0].message.content)
    except json.JSONDecodeError as e:
        logging.error(f'JSON parsing failed: {e}')
    except Exception as e:
        logging.error(f'API call failed: {e}')
    return {'Q': 'API error'}

In [271]:
from typing import Dict, List, Tuple
import uuid
from tqdm import tqdm
from langchain_core.documents import Document 

class QuestionContextManager:
    """
    Manages a collection of questions and their associated context chunks as Document objects.
    Allows for adding questions with contexts and displaying a specified number of these question-context pairs.
    """

    def __init__(self):
        self.questions: Dict[str, Document] = {}
        self.contexts: Dict[str, Document] = {}
        self.question_context_id_pairs: Dict[str, List[str]] = {}

    def add_question_context(self, question: Document, context: Document):
        """
        Adds a question and its associated context (both as Document objects) to the manager.
        Generates unique IDs for both the question and the context, storing them and their association.

        Parameters:
        - question (Document): The Document object containing the question.
        - context (Document): The Document object containing the context.
        """
        unique_question_id = str(uuid.uuid4())
        unique_context_id = str(uuid.uuid4())
        self.questions[unique_question_id] = question
        self.contexts[unique_context_id] = context
        self.question_context_id_pairs[unique_question_id] = [unique_context_id]

    @property
    def question_context_pairs(self) -> List[Tuple[Document, List[Document]]]:
        """
        Returns a list of tuples, each containing a question Document and a list of its associated context Documents.
        """
        return [(self.questions[qid], [self.contexts[cid] for cid in self.question_context_id_pairs[qid]]) for qid in self.questions]

    def display_question_context_pairs(self, num_pairs: int = None):
        """
        Displays a specified number of question-context pairs. If no number is specified, all pairs are displayed.

        Parameters:
        - num_pairs (int, optional): The number of question-context pairs to display. If None, all pairs are displayed. Defaults to None.
        """
        displayed_pairs = 0
        for q_id, context_ids in self.question_context_id_pairs.items():
            if num_pairs is not None and displayed_pairs >= num_pairs:
                break

            question = self.questions[q_id]
            print(f"Question: {question.page_content}")
            for c_id in context_ids:
                context = self.contexts[c_id]
                print(f"\nContext: {context.page_content}")
            print("-" * 40)  # Separator for readability
            displayed_pairs += 1

    def filter_questions_by_length(self, min_length: int = 20, max_length: int = 150):
        """
        Filters out questions that do not fall within the specified minimum and maximum character length.
        Updates the object by removing questions and their associated contexts that do not meet the criteria.

        Parameters:
        - min_length (int): The minimum character length for questions to be kept. Default to 20.
        - max_length (int): The maximum character length for questions to be kept. Default to 150.
        """
        questions_to_remove = [q_id for q_id, question in self.questions.items()
                               if not (min_length <= len(question.page_content) <= max_length)]

        # Remove the questions and question_context pairs
        for q_id in questions_to_remove:
            del self.questions[q_id]
            del self.question_context_id_pairs[q_id]

        # Identify contexts that are no longer linked to any questions
        contexts_to_remove = {context_id for context_id in self.contexts
                              if all(context_id not in contexts for contexts in self.question_context_id_pairs.values())}

        # Remove these contexts
        for context_id in contexts_to_remove:
            del self.contexts[context_id]

        print(f"Removed {len(questions_to_remove)} questions.")
        
    def update_question_context_pairs(self, q_c_to_append: Dict[str, List[str]]):
        """
        Appends the question-context matches to the existing question_context_id_pairs,
        ensuring no duplicates are added.

        Parameters:
        - q_c_to_append (Dict[str, List[str]]): A dictionary with question IDs as keys and lists of context IDs to append as values.
        """
        for q_id, c_id_list in q_c_to_append.items():
            if q_id in self.question_context_id_pairs:
                # Create a set from the existing IDs for quick lookup
                existing_ids_set = set(self.question_context_id_pairs[q_id])
                # Filter out duplicates while preserving order
                filtered_c_id_list = [c_id for c_id in c_id_list if c_id not in existing_ids_set]
                # Extend the existing list with the filtered, non-duplicate IDs
                self.question_context_id_pairs[q_id].extend(filtered_c_id_list)
            else:
                # Directly assign the list if the q_id is not already present
                self.question_context_id_pairs[q_id] = c_id_list

    def __repr__(self):
        return f"<QuestionContextManager with {len(self.questions)} questions>"

In [230]:
from typing import Union

def generate_questions(textContexts: List[Union[Document, str]], num_questions: int = 1, oai_model: str = "gpt-4-0125-preview", duplicate_metadata: bool = True) -> QuestionContextManager:
    """
    Generates questions from a list of context Documents and returns a QuestionContextManager
    containing the generated questions and their contexts.

    Parameters:
    - textContexts (List[Union[Document, str]]): A list of Document objects or strings to generate questions from.
    - num_questions (int): Number of questions to generate per context. Default is 1.
    - oai_model (str): The model to use for generating questions. Default is "gpt-4-0125-preview".
    - duplicate_metadata (bool): If True, duplicate the metadata from context to the generated questions.

    Returns:
    QuestionContextManager: An object containing the generated questions and their contexts.
    """
    result = QuestionContextManager()
    for context in tqdm(textContexts):
        #If input is simply a list of strings, convert to doc with empty metadata
        if isinstance(context, str):
            context = Document(page_content=context, metadata={})
            
        question_prompt = generate_question_template(context.page_content, num_questions)
        response = question_api_call(question_prompt, oai_model)  
        try:
            questions = response['Q']
            for question_text in questions:
                question_document = Document(page_content=question_text.strip(), metadata=context.metadata if duplicate_metadata else {})
                result.add_question_context(question_document, context)
        except KeyError as e:
            print(f'Error parsing json response: {e}')
    return result

In [231]:
#Generate questions for a sub-sample of the passed documents
qc_meta = generate_questions(docs_passed_llm[:10])

100%|██████████| 10/10 [00:26<00:00,  2.69s/it]


### Step 3: Question filtering

In [232]:
qc_meta.filter_questions_by_length(min_length=20, max_length=150) #default values
qc_meta.display_question_context_pairs(3)

Removed 0 questions.
Question: Hvem skal regulere løbende erstatninger tilkendt før 1. januar 2024?

Context: De private arbejdsskadeforsikringsselskaber samt de arbejdsgivere, der er fritaget for at afgive risikoen efter loven, skal selv regulere løbende erstatninger, som er tilkendt før 1. januar 2024. Ved løbende erstatninger tilkendt i 2024 vil det fremgå af Arbejdsmarkedets Erhvervssikrings afgørelse, hvilke beløb, der skal udbetales i 2024.
----------------------------------------
Question: Hvordan beregnes grundlønnen for løbende erstatninger ifølge Arbejdstilsynets vejledning fra den 5. januar 2024?

Context: Arbejdstilsynet, den 5. januar 2024
Sine Frederiksen
/ Helle Klostergaard Christensen
Bilag 1
Bilaget indeholder eksempler på beregninger af kapitalerstatninger, godtgørelsesbeløb og overgangsbeløb samt løbende erstatninger og godtgørelser, som tilskadekomne eller dennes efterladte har ret til efter lov om arbejdsskadesikring, lov om sikring mod følger af arbejdsskade, lov

### Step 4: Updating the question-context pairs

In [277]:
import chromadb
from chromadb.utils import embedding_functions

def initialize_chroma_collection(client: chromadb.Client, collection_name: str, embedding_model: str, similarity_metric: str="cosine") -> chromadb.Collection:
    """Initialize or reset a ChromaDB collection with a specified embedding model.

    Args:
        client (chromadb.Client): The ChromaDB client instance.
        collection_name (str): The name of the collection to create or reset.
        embedding_model (str): The embedding model to use for the collection.
        similarity_metric (str): Similarity metric used for calculating embedding distance

    Returns:
        chromadb.Collection: The created ChromaDB collection.
    """
    # Create a new collection with the specified embedding function
    db_collection = client.create_collection(
        name=collection_name,
        embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction(embedding_model, normalize_embeddings=True),
        metadata={"hnsw:space": similarity_metric}
    )
    return db_collection

def add_documents_to_chroma(collection: chromadb.Collection,  id_document_pairs: Dict[str, Union[List[str], List[Document]]], document_prepend: str):
    """Add documents to a specified ChromaDB collection with optional metadata.

    Args:
        collection (chromadb.Collection): The ChromaDB collection to add documents to.
        id_document_pairs: Dict[str, Union[List[str], List[Document]]]: A Dict of IDs as keys and a list of Documents or strings as values
        document_prepend (str, optional): String to prepend to documents prior embedding)
    """

    #If values are Documents
    if isinstance(list(id_document_pairs.values())[0], Document):
        context_documents = list(id_document_pairs.values())
        context_texts = [f'{document_prepend} {doc.page_content}' for doc in context_documents]
        context_ids = list(id_document_pairs.keys())
        context_metadatas = [{"type": "context", **doc.metadata} for doc in context_documents]
    #If values are Strings
    else:
        context_texts = [f'{document_prepend} {doc}' for doc in id_document_pairs.values()]
        context_ids = list(id_document_pairs.keys())
        context_metadatas = [{"type": "context"} for _ in context_texts]
    
    collection.add(
        documents=context_texts,
        ids=context_ids,
        metadatas=context_metadatas
    )
    
# Example usage
chroma_client = chromadb.Client()
collection_name = "qc_collection"
embedding_model = 'intfloat/multilingual-e5-base'

# Check if the collection already exists
if chroma_client.get_collection(collection_name):
    # If it does, delete the existing collection
    chroma_client.delete_collection(collection_name)

# Initialize or reset the ChromaDB collection
db_collection = initialize_chroma_collection(chroma_client, collection_name, embedding_model)
add_documents_to_chroma(db_collection, qc_meta.contexts, document_prepend='passage:')

In [266]:
def filter_context_candidates(chroma_db_collection, question_context_object: QuestionContextManager, top_k: int = 5, question_prepend: str='query:', dist_threshold: float = 0, include_origin_context: bool = False) -> Dict[str, List[str]]:
    """
    Filters context candidates for each question based on similarity scores and optionally includes the original context.

    Parameters:
    - chroma_db_collection: The database collection to query for context candidates.
    - question_context_object: An object containing questions, contexts and queston-context ID pairs
    - top_k: The number of top results to consider from the query.
    - dist_threshold: The threshold for including additional contexts based on their distance from the ground truth context.
    - include_origin_context: A boolean to indicate whether the original context should be included in the results.

    Returns:
    - A dictionary mapping each question ID to a list of filtered context candidate IDs.
    """
    
    query_filtered = {}
    
    question_texts = [f'{question_prepend} {doc.page_content}' for doc in question_context_object.questions.values()]

    batch_query_result = chroma_db_collection.query(
        query_texts=question_texts,
        where={"type": "context"},
        n_results=top_k
    )

    for idx, (q_id, q_document) in enumerate(question_context_object.questions.items()):
        query_id_list = batch_query_result['ids'][idx]
        query_distances_list = batch_query_result.get('distances', [])[idx]

        ground_truth_id = question_context_object.question_context_id_pairs[q_id][0]
        context_ids = []

        if ground_truth_id in query_id_list:
            gt_index = query_id_list.index(ground_truth_id)
            gt_distance = query_distances_list[gt_index]

            # Include higher-ranked items than the ground truth
            context_ids.extend(query_id_list[:gt_index])

            # Optionally include the ground truth
            if include_origin_context:
                context_ids.append(ground_truth_id)

            # Include lower-ranked items within the distance threshold
            for id_, distance in zip(query_id_list[gt_index + 1:], query_distances_list[gt_index + 1:]):
                if abs(distance - gt_distance) <= dist_threshold:
                    context_ids.append(id_)
        else:
            # If ground truth is not in the list, return the full list of querries (top_k)
            context_ids.extend(query_id_list)

        query_filtered[q_id] = context_ids

    return query_filtered


In [267]:
context_candidates_id = filter_context_candidates(db_collection, qc_meta, dist_threshold=0.05, include_origin_context=False)
context_candidates_id

{'512f3d46-fe16-44e4-a2af-41e4cd863b68': [],
 '4639269a-aac5-4533-8ebd-72d67bc2d1bd': ['27c95174-557d-41e6-bca0-fcaba2ed3995'],
 '807d007e-4de1-4701-b412-89cbfdeaabf0': [],
 '411d0a43-e58e-4abe-bb23-642414a8d95b': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  'cbf49bab-01e7-4747-9b07-a294d524de23'],
 '1b9a4c43-31c2-4605-8892-a2a886d82aa5': ['954076d4-15f8-4d59-b523-71c75a9801e2',
  'cbf49bab-01e7-4747-9b07-a294d524de23'],
 '132b4306-cd81-47f1-8674-a1644d3d544a': [],
 '2b630fbe-830d-493e-8ea1-10d214493034': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  '954076d4-15f8-4d59-b523-71c75a9801e2',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 '4283f5e1-29c5-42c5-bc58-bd7ce1381216': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 'e8eb6c03-2ad7-4bde-b2f7-a99a67bb46b4': [],
 'be587ec5-f06f-4a77-b9c7-da0bcc507763': ['3ee5a517-271b-4a28-83f9-33c52fdb6bb0',
  'da55bea0-ae38-492e-bf60-a9f7e81269bf',
  '9640f0aa-b801-40b6-9ccd-440983f4c3a1',
  '954076d4-15f8-4d59-b523-71

In [257]:
#Remove empty lists
context_candidates_id = {k: v for k, v in context_candidates_id.items() if v}
context_candidates_id

{'512f3d46-fe16-44e4-a2af-41e4cd863b68': [],
 '4639269a-aac5-4533-8ebd-72d67bc2d1bd': ['27c95174-557d-41e6-bca0-fcaba2ed3995'],
 '807d007e-4de1-4701-b412-89cbfdeaabf0': [],
 '411d0a43-e58e-4abe-bb23-642414a8d95b': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  'cbf49bab-01e7-4747-9b07-a294d524de23'],
 '1b9a4c43-31c2-4605-8892-a2a886d82aa5': ['954076d4-15f8-4d59-b523-71c75a9801e2',
  'cbf49bab-01e7-4747-9b07-a294d524de23'],
 '132b4306-cd81-47f1-8674-a1644d3d544a': [],
 '2b630fbe-830d-493e-8ea1-10d214493034': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  '954076d4-15f8-4d59-b523-71c75a9801e2',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 '4283f5e1-29c5-42c5-bc58-bd7ce1381216': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 'e8eb6c03-2ad7-4bde-b2f7-a99a67bb46b4': [],
 'be587ec5-f06f-4a77-b9c7-da0bcc507763': ['3ee5a517-271b-4a28-83f9-33c52fdb6bb0',
  'da55bea0-ae38-492e-bf60-a9f7e81269bf',
  '9640f0aa-b801-40b6-9ccd-440983f4c3a1',
  '954076d4-15f8-4d59-b523-71

#### Using LLM to assess context candidates

In [268]:
def c_eval_system_prompt():
    sys_prompt = """Din opgave er at evaluere hvorvidt et givent tekstuddrag indeholder svaret til et spørgsmål. Du skal alene vurdere om uddraget indeholder svaret, og ikke om svaret er korrekt.

    - Tildel scoren 1, hvis tekstuddraget indeholder svaret til spørgsmålet.

    - Tildel scoren 0, hvis tekstuddraget ikke kan bruges til at besvare spørgsmålet.
    """
    return sys_prompt

def c_eval_user_prompt(question: str, context: str) -> str:
    """Prepare the prompt for the API call."""
    
    qa_egnet_tmlp = """Din Opgave:
    
    Vurder om følgende spørgsmål kan besvares ud fra den givne kontekst i tekstuddraget:
    
    spørgsmål:
    {insert_question}
    
    tekstuddrag:
    {insert_context}
    
    Returner din vurdering i følgende JSON-format:

    {{
    "context_score": [indsæt enten 0 eller 1 her]
    }}
    """
    return qa_egnet_tmlp.format(insert_question=question, insert_context=context)


def context_question_assesment(context_candidates, question_context_object: QuestionContextManager) -> Dict[str, List[str]]:
    """
    Iterates over the context candidate texts and uses a LLM call to assess whether the context matches the corresponding question
    
    Returns:
    A dictionary mapping each question ID to a list of context IDs, that according to the LLM can be used to answer the question
    """
    question_context_matches = {}
    system_prompt = c_eval_system_prompt()
    
    for q_id, c_id_list in tqdm(context_candidates.items()):
        question_text = question_context_object.questions[q_id].page_content
        for c_id in c_id_list:
            context_text = question_context_object.contexts[c_id].page_content
            user_prompt = c_eval_user_prompt(question=question_text, context=context_text)
            response = json_api_call(system_prompt, user_prompt)
            if response:
                if response['context_score'] == 1:
                    if q_id not in question_context_matches:
                        question_context_matches[q_id] = [c_id]
                    else:
                        question_context_matches[q_id].append(c_id)
                else:
                    continue
            else:
                logging.error(f'Failed to evaluate below text due to an earlier error. \n')
    return question_context_matches

In [264]:
question_context_matches = context_question_assesment(context_candidates_id, qc_meta)

  0%|          | 0/6 [00:00<?, ?it/s]

100%|██████████| 6/6 [00:09<00:00,  1.63s/it]


In [265]:
question_context_matches

{'1b9a4c43-31c2-4605-8892-a2a886d82aa5': ['954076d4-15f8-4d59-b523-71c75a9801e2',
  'cbf49bab-01e7-4747-9b07-a294d524de23'],
 '2b630fbe-830d-493e-8ea1-10d214493034': ['954076d4-15f8-4d59-b523-71c75a9801e2',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 '4283f5e1-29c5-42c5-bc58-bd7ce1381216': ['b8355203-fda9-4d44-8d5d-e99b13df41a8',
  '3ee5a517-271b-4a28-83f9-33c52fdb6bb0'],
 'be587ec5-f06f-4a77-b9c7-da0bcc507763': ['3ee5a517-271b-4a28-83f9-33c52fdb6bb0',
  'da55bea0-ae38-492e-bf60-a9f7e81269bf',
  '9640f0aa-b801-40b6-9ccd-440983f4c3a1']}

In [210]:
# Function to append the filtered question-context matches to the existing qc_meta.question_context_id_pairs
def update_question_context_pairs(q_c_to_append, question_context_object: QuestionContextManager):
    for q_id, c_id_list in q_c_to_append.items():
        if q_id in question_context_object.question_context_id_pairs:
            # Create a set from the existing IDs for quick lookup
            existing_ids_set = set(question_context_object.question_context_id_pairs[q_id])
            # Filter out duplicates while preserving order
            filtered_c_id_list = [c_id for c_id in c_id_list if c_id not in existing_ids_set]
            # Extend the existing list with the filtered, non-duplicate IDs
            question_context_object.question_context_id_pairs[q_id].extend(filtered_c_id_list)
        else:
            # Directly assign the list if the q_id is not already present
            question_context_object.question_context_id_pairs[q_id] = c_id_list

In [None]:
update_question_context_pairs(question_context_matches, qc_meta)

In [270]:
qc_meta.update_question_context_pairs(question_context_matches)

AttributeError: 'QuestionContextManager' object has no attribute 'update_question_context_pairs'

### Result

In [211]:
qc_meta.question_context_id_pairs

{'af879f4e-769a-48ed-a06d-b68612ae72b3': ['92a977ff-de34-4716-90d4-b0cacc3cee6d'],
 '8e89c5c8-56aa-447c-9566-82e927396208': ['4b76eac4-af83-41d2-a301-9bb68fbad516'],
 '69dd61a1-a550-44ff-b889-071e47f8e880': ['03f949b3-7afe-4b87-b8fa-24f94fdbd712',
  '1592317d-ce26-4fd2-bef2-6f27ddbd11be'],
 '1e423702-3a47-49e3-9091-3aca496ff105': ['1592317d-ce26-4fd2-bef2-6f27ddbd11be'],
 '80e804eb-fb01-4841-aa87-2bfeccf02ae4': ['769c0932-22ac-44ca-9f6b-6e483e922fc5',
  '1592317d-ce26-4fd2-bef2-6f27ddbd11be',
  '03f949b3-7afe-4b87-b8fa-24f94fdbd712'],
 '3f54b2c2-aa98-4230-9158-3d40c98041dd': ['0f37ace7-3661-492d-94b9-8cd75fd5ef4a'],
 '3d21924a-6893-4390-97a3-799a0489c853': ['a322ce23-a3f3-4f15-abf2-054f758fbf59',
  '1592317d-ce26-4fd2-bef2-6f27ddbd11be'],
 '6f7d27b7-61d2-400b-8b97-54fa200ce19c': ['cd7283eb-3cb3-4d5f-81bd-95ac70b891fe',
  '6fd3df39-57cf-4163-a23e-c33d63b3bd87',
  '1979dc66-4820-463f-8310-faace278cbfb'],
 '59b53755-29b9-4bd6-98ba-024ec284065b': ['1979dc66-4820-463f-8310-faace278cbfb'],
 

In [212]:
qc_meta.questions

{'af879f4e-769a-48ed-a06d-b68612ae72b3': Document(page_content='Hvem skal regulere løbende erstatninger tilkendt før 1. januar 2024?', metadata={'title': 'Vejledning om regulering af satser fra 1. januar 2024 efter lov om arbejdsskadesikring, lov om sikring mod følger af arbejdsskade, lov om arbejdsskadeforsikring og lov om forsikring mod følger af ulykkestilfælde'}),
 '8e89c5c8-56aa-447c-9566-82e927396208': Document(page_content='Hvordan beregnes grundlønnen for løbende erstatninger for tab af erhvervsevne ifølge Arbejdstilsynets vejledning fra den 5. januar 2024?', metadata={'title': 'Vejledning om regulering af satser fra 1. januar 2024 efter lov om arbejdsskadesikring, lov om sikring mod følger af arbejdsskade, lov om arbejdsskadeforsikring og lov om forsikring mod følger af ulykkestilfælde'}),
 '69dd61a1-a550-44ff-b889-071e47f8e880': Document(page_content='Hvem er omfattet af pligten til selvbooking ifølge loven om en aktiv beskæftigelsesindsats fra 22. maj 2022?', metadata={'titl

In [213]:
qc_meta.contexts

{'92a977ff-de34-4716-90d4-b0cacc3cee6d': Document(page_content='De private arbejdsskadeforsikringsselskaber samt de arbejdsgivere, der er fritaget for at afgive risikoen efter loven, skal selv regulere løbende erstatninger, som er tilkendt før 1. januar 2024. Ved løbende erstatninger tilkendt i 2024 vil det fremgå af Arbejdsmarkedets Erhvervssikrings afgørelse, hvilke beløb, der skal udbetales i 2024.', metadata={'title': 'Vejledning om regulering af satser fra 1. januar 2024 efter lov om arbejdsskadesikring, lov om sikring mod følger af arbejdsskade, lov om arbejdsskadeforsikring og lov om forsikring mod følger af ulykkestilfælde'}),
 '4b76eac4-af83-41d2-a301-9bb68fbad516': Document(page_content='Arbejdstilsynet, den 5. januar 2024\nSine Frederiksen\n/ Helle Klostergaard Christensen\nBilag 1\nBilaget indeholder eksempler på beregninger af kapitalerstatninger, godtgørelsesbeløb og overgangsbeløb samt løbende erstatninger og godtgørelser, som tilskadekomne eller dennes efterladte har re