# <u>Foundations of Language Technology 2023/24</u>

## <u>Shared Task - SubTask 3 </u>

##### Please enter your group number as well as the name of each group member in the field below.

YOUR ANSWER HERE
Group Number: 5
Group Member:

Jobst Harzer
Nehath Nils Mia
Xiaoyan Xue

_**Regarding types, documentation, and output:**_

_We have aimed to provide clear descriptions of the tasks and the underlying methods. If anything is unclear, please reach out to us in the Shared Task  discussion forum on Moodle. To enhance clarity, we've included type hints for both function parameters and return values as well as a short description of the method and its components. You are only required to implement the functionality of methods and fill in your code in specified code cells (YOUR CODE HERE)_

_While implementing your code, you should use the provided method stubs and parameters. Additionally, make sure that your code runs smoothly without errors and executes within a reasonable amount of time. A recommended practice is to utilize "Kernel/Restart & Run All" before submission to verify its functionality._

_We encourage the use of comments where necessary to explain your code. Lastly, pay attention to how you output the results._

_**Please only modify the template in the specified markdown and code cells(e.g. YOUR CODE / ANSWER) and refrain from modifying other cells. Especially the blank cells are left blank on purpose since they are used to autograde your submission. If you modify these cells the automatic grading will fail for your submission and we might deduct points. The cells containing tests should remain untouched to ensure accurate evaluation. If you wish to conduct additional tests, utilize the provided code cell for your solutions (YOUR CODE HERE). Unfinished methods will contain the following line "raise NotImplementedError()", these are used to raise an error if the method is not implemented yet. Please replace this line of code with your actual implementation. You are allowed to write helper functions but please utilize the provided code cell for your solutions (YOUR CODE HERE). No additional imports should be made, you are only allowed to use modules mentioned in the code cells or built-in python functions, which do not need to be imported.**_

**Submission:**

Please upload your submission to Moodle before  <font color="red">Jan 21th, 23:59pm </font>!

Submission format: `Group_XX_Shared_Task_3.zip`( e.g. for Group 29, you should submit the file with the name Group_29_Shared_Task_3.zip). 

Your submission should contain your filled out Jupyter notebook (naming schema: `Shared_Task 3.ipynb`) and the devset json file `dev.json` which is necessary to run your code.

Each submission must be handed in only once per group.

## Task 1 - 70 Points

In [1]:
# run this cell to import all modules needed for Task 1
from typing import Tuple, List, Dict, DefaultDict, Set
import json
from collections import Counter
from collections import defaultdict
import random

##### __a)__(5 Points) 
You will be working with the devset which consists of 100 CCAs and 500 QA-Pairs with annotations made by you in the previous Subtasks of the Shared Tasks. Look into the dataset to get a better feel for the data and its structure. For the analysis of the devset and the creation of datasets for training classifiers later, we need to read in the CCAs from the devset and save the relevant information.

For this task, go through all QA-Pairs in the file and save them in a list of tuples containing the following information: `question type` refers to the annotated type of the question of the current QA-Pair, `question` refers to the question text of the current QA-Pair,  `gold answer` refers to the given gold answer of the current QA-Pair, `topics` include the topics from topic1 and topic2 of the current QA Pair, `llm model name` refers to the name of the Large Language model that generated the answer of the current QA Pair, `llm answer` refers to the whole answer presented by the specific Large Language model of the current QA-Pair, `llm answer units` refers to the answer units which make up the answer given by the Large Language model of the current QA-Pair, `labels` are the specific labels for each answer unit of the current QA-Pair.
Each tuple represents a single QA-Pair and should contain all the above-mentioned information in the order mentioned. All Strings should be processed and read in like in the tutorial for this Shared Task. __Exclude__ QA-Pairs that were annotated as invalid answers (QA-Pairs were considered invalid when the generated answers content were not relevant to the question).

In [2]:
def extract_data(path: str) -> List[tuple]:
    """
    Extracts the relevant information from the devset JSON file containing cca_llm_answers.

    Args:
        path (str): The path to the JSON file.

    Returns:
        list: A list of tuples, each containing information about a QA pair.
              Each tuple includes the following elements in this particular order:
              - question type: str
              - question: str
              - gold answer: str
              - topics: List[str]
              - llm model name: str
              - llm answer: str
              - llm answer units: List[str]
              - labels: List[str]
    """
     
    # YOUR CODE HERE
    with open(path,"r", encoding="utf-8",) as file:
        cca_llm_answers = json.load(file)
        annotation_entries = []

    for qa_pair in cca_llm_answers:
        #id = cca_llm_answer["id"]
        topics = []
        units= []

        # Exclude invalid QAs
        valid = qa_pair["annotations"][0]["result"][0]["value"]["choices"][0]
        if(valid == "Invalid"):
            continue;

        # Split units of the llm_answer
        for m in qa_pair["annotations"][0]["result"]:
            if 'text' in m["value"]:
                units.append(m["value"]["text"])
                # print(m["value"]["text"])

        # print(qa_pair["annotations"][0]["result"][0]["value"]["choices"][0])
            
        llm_model = qa_pair["data"]["llm_model_name"][:-7]
        topic_1 = qa_pair["data"]["topic1"][1:-1].replace("'", "").split(", ")
        topic_2 = qa_pair["data"]["topic2"][1:-1].replace("'", "").split(", ") 
        question = qa_pair["data"]["question"].replace('\n', '')
        gold_answer = qa_pair["data"]["gold_answer"].replace('\n', '')
        llm_answer = qa_pair["data"]["llm_answer"].replace('\n', '')
        
        # Split units of the llm_answer
        # units =   qa_pair["data"]["llm_answer"].replace('\n', '.').split(".")
        # sents = []
        # for sent in units:
        #     if len(sent) < 15:
        #         continue
        #     sents.append(sent)
            

        # Add topics into a list if not empty
        for i in topic_1:
            if(i != ''):
                topics.append(i)
        for m in topic_2:
            if(m != ''):
                topics.append(m)
        
        # we initalize the values as an empty string or an empty list
        question_type, labels = "", []

        # we iterate through the entries of the values to extract validity, question_type and the labels 
        for answer_unit in qa_pair["annotations"][0]["result"]:
                if answer_unit["type"] == "labels":
                    # Extracting labels and handling the case where labels are not present
                    labels.append(answer_unit["value"].get("labels", [""])[0])
                elif answer_unit["type"] == "choices" and answer_unit["origin"] == "manual":
                        question_type = answer_unit["value"]["choices"][0]

        # print(topics)
        # print(llm_model)
        # print(llm_answer)
        # print(" ")
        # if(len(labels)==14):
        # print("length of units:")
        # print(len(units))
        # print("length of labels:")
        # print(len(labels))
        # print(" ")

        # if(len(units)!=len(labels)):
        #     print("length of units:")
        #     print(len(units))
        #     print("length of labels:")
        #     print(len(labels))
        
        #     for i in sents:
        #         print(i)
        #         print(" ")
        #     # print(sents)
        #     print(labels)
        #     print(" ")

        # appending the tuple of information for this qa_pair
        annotation_entries.append(
            (question_type, question, gold_answer, topics, llm_model, llm_answer, units, labels)
        )


        

    qa_pairs = annotation_entries
        
    return qa_pairs

In [3]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
data = extract_data("dev.json")
assert len(data[0]) == 8
assert (type(data[0])) == tuple
assert type(data) == list
assert 'Agree with the gold answer' in data[0][7]

In [4]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __b)__ (15 Points) 
In this task we want to compare the relationship between the number of LLM answer units and the number of harmful answers with the same amount of answer units (Pay attention to what is considered a harmful answer). Consider for this task all extracted QA-Pairs from the task 1a). Create for this task, a dictionary that maps the number of answer units of a generated answer by a LLM to the number of harmful and non-harmful answers with the same amount of answer units.

In [5]:
def map_answer_units_to_harmfulness(qa_pairs: List[Tuple[str, str, str, List[str], str, str, List[str], List[str]]]) -> dict:
    """
    Maps the number of answer units in LLM Answers to the count of harmful and non-harmful answers with the same number of answer units.

    Args:
        qa_pairs (List[tuple]): List of tuples containing information about QA pairs.

    Returns:
        dict[int, dict[str, int]]: A dictionary where keys represent the number of answer units per LLM answer, 
                                   and values are dictionaries with counts of harmful and non-harmful answers.
                                   The nested dictionaries have keys 'harmful_answers' and 'non_harmful_answers'.
    """
    answer_units_dict = {}
    # YOUR CODE HERE
    def create_llm_models_dict(data):
        '''
        Creates a dictionary mapping llm_model names to a list of associated labels.
    
        Args:
            data(list of tuples): the extracted data from the function `extract_data`.
    
        Returns:
            dict[str, list[str]]: A dictionary where keys are llm_model names and values are lists of associated labels.
        '''
        # defaultdict handles the case where the key does not exist in the dictionary yet by creating a new key,value pair
        llm_model_labels = defaultdict(list)
        
        for qa_pair in data:
            units, label = qa_pair[6], qa_pair[7]
            
            # print(label)

            harm = 0
            non_harm = 0

            for i in label:

                if i == "Contradiction" or i == "Exaggeration":
                    harm = harm + 1
                else:
                    non_harm  = non_harm + 1
                    
            # Extend the list of labels associated with the current llm_model
            if harm > non_harm:
                llm_model_labels[len(units)].extend(["harm"])
            else:
                llm_model_labels[len(units)].extend(["non_harm"])

            
            
                
            #     if i == "Contradiction" or i == "Exaggeration":
            #         harm = harm + 1
            #     else:
            #         non_harm  = non_harm + 1
                    
            # # Extend the list of labels associated with the current llm_model
            # if harm > non_harm:
            #     llm_model_labels[len(units)].extend("harmful")
            # else:
            #     llm_model_labels[len(units)].extend("non_harmful")
                
                
            
            # Extend the list of labels associated with the current llm_model
            # llm_model_labels[len(units)].extend(label)
    
        # Convert the defaultdict to a regular dictionary before returning
        return dict(llm_model_labels)

    def count_label(category_labels_dict):
        '''
        Count the occurrences of each label for each category in a given label dictionary.
    
        Parameters:
        - label_dict (dict[str, List[str]]): Dictionary mapping categories to lists of labels.
    
        Returns:
        - dict[str, dict[str, int]]: Dictionary containing label names and their associated counts for each category.
        '''
        # Creating a dict by counting occurrences of labels for each category using Counter 
        return {category: dict(Counter(labels)) for category, labels in category_labels_dict.items()}

    def count_harmful_labels(label_occurences):
        '''
        Calculate the count of harmful and non-harmful labels for each category.
    
        Args:
            label_occurences (dict[str, dict[str, int]]): A dictionary mapping categories to label names and their associated counts.
    
        Returns:
            (dict[str, dict[str, int]]): A dictionary containing counts of harmful and non-harmful labels for each category.
        '''
        
        harmful_label_count = {}
        for category, labels_dict in label_occurences.items():
            
            # Calculate the occurences of harmful labels ('Exaggeration' and 'Contradiction') for the current category.
            harmful_count = labels_dict.get('harm', 0)
            # Calculate the occurences of non-harmful labels for the current category by excluding Exaggeration, Contradiction.
            non_harmful_count = sum(value for key, value in labels_dict.items() if key not in {'harm'})
            #  Create a dictionary entry for the current category, containing counts of harmful and non-harmful labels.
            harmful_label_count[category] = {'harmful_answers': harmful_count, 'non_harmful_answers': non_harmful_count}
       
        return harmful_label_count

    llm_models_dict = create_llm_models_dict(data)

    # print(llm_models_dict)
    
    label_occurences = count_label(llm_models_dict)

    # print(label_occurences)
    # print(" ")
    
    harmful_label_occurences = count_harmful_labels(label_occurences)

    # print(harmful_label_occurences)

    answer_units_dict = harmful_label_occurences

    print(answer_units_dict)
    print(" ")

    print("answer_units_dict[14]")
    print(answer_units_dict[14])
    print("sum(answer_units_dict[14].values())")
    print(sum(answer_units_dict[14].values()))

    # print(answer_units_dict[1])

    return answer_units_dict

In [6]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
answer_units_dict = map_answer_units_to_harmfulness(data)
assert isinstance(answer_units_dict, dict)
assert len(answer_units_dict[1].values()) == 2
assert 5 in answer_units_dict.keys()
assert sum(answer_units_dict[14].values()) == 10
assert answer_units_dict[1] == {'harmful_answers': 1, 'non_harmful_answers': 3}

{1: {'harmful_answers': 1, 'non_harmful_answers': 3}, 5: {'harmful_answers': 5, 'non_harmful_answers': 69}, 3: {'harmful_answers': 14, 'non_harmful_answers': 45}, 9: {'harmful_answers': 2, 'non_harmful_answers': 26}, 13: {'harmful_answers': 0, 'non_harmful_answers': 11}, 8: {'harmful_answers': 1, 'non_harmful_answers': 32}, 11: {'harmful_answers': 0, 'non_harmful_answers': 20}, 2: {'harmful_answers': 8, 'non_harmful_answers': 25}, 7: {'harmful_answers': 4, 'non_harmful_answers': 28}, 4: {'harmful_answers': 4, 'non_harmful_answers': 87}, 14: {'harmful_answers': 0, 'non_harmful_answers': 10}, 12: {'harmful_answers': 1, 'non_harmful_answers': 12}, 10: {'harmful_answers': 1, 'non_harmful_answers': 22}, 6: {'harmful_answers': 2, 'non_harmful_answers': 40}, 15: {'harmful_answers': 0, 'non_harmful_answers': 7}, 17: {'harmful_answers': 0, 'non_harmful_answers': 2}, 18: {'harmful_answers': 0, 'non_harmful_answers': 4}, 22: {'harmful_answers': 0, 'non_harmful_answers': 1}, 16: {'harmful_answers'

In [7]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __c)__ (15 Points) 
Save the 10 answer units with the highest relative amount of harmful answers to the total number of answers with this amount of answer units. Only consider answer units with at least 10 total answers(including harmful and non-harmful answers).

In [8]:
def find_top_answer_units(answer_units_dict: Dict[int, Dict[str, int]]) -> List[Tuple[int, float]]:
    """
    Find the top 10 answer units sorted by the fraction of harmful answers among the total answers for each number of answer units.

    Args:
        answer_units_dict (Dict[int, Dict[str, int]]): Dictionary mapping the number of answer units to counts of harmful and non-harmful answers.

    Returns:
        List[Tuple[int, float]]: A list of tuples where each tuple contains the number of answer units and the fraction of harmful answers among the total answers.
    """
    top_answer_units = []
    # YOUR CODE HERE

    # A function that returns the length of the value:
    def min_func(e):
      return e[1]
    
    min_fraction = 1.0

    for i in answer_units_dict:
        # Check if the category exists in the dictionary
        labels, counts = zip(*answer_units_dict[i].items())
        total_amount_labels = sum(counts)
        total_dist = 0
        if total_amount_labels < 10:
            continue
        # print("\nThe distributions for the number of units "+ str(i) + " are as following: \n") 

        units = (i,counts[0]/total_amount_labels)

        if len(top_answer_units) < 10:
            top_answer_units.append(units)
            min_fraction = min(min_fraction, counts[0]/total_amount_labels) 
        elif min_fraction != max(min_fraction, counts[0]/total_amount_labels):
            top_answer_units.append(units)

            top_answer_units.sort(key = min_func)
            # print("min")
            # print(top_answer_units[0])
            top_answer_units.remove(top_answer_units[0])
                    
            min_fraction = top_answer_units[0][1]
            
            
        # print(units)
        # print(min_fraction)

        # Show distributions if specified
        # for i in range(len(labels)):
        #     print("The distribution for "+ labels[i] + " is " +str(counts[i]/total_amount_labels))

    print(top_answer_units)
    
    return top_answer_units

In [9]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
top_answer_units = find_top_answer_units(answer_units_dict)
assert isinstance(top_answer_units, list)
assert len(top_answer_units) == 10
assert all(isinstance(element, tuple) for element in top_answer_units)
assert all(isinstance(element[0], int) for element in top_answer_units)
assert all(isinstance(element[1], float) for element in top_answer_units)

[(8, 0.030303030303030304), (10, 0.043478260869565216), (4, 0.04395604395604396), (6, 0.047619047619047616), (5, 0.06756756756756757), (9, 0.07142857142857142), (12, 0.07692307692307693), (7, 0.125), (3, 0.23728813559322035), (2, 0.24242424242424243)]


In [10]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __d)__ (15 Points)
Now we want to examine how often each label(6 label categories) occurs for every combination of large language model, question type and a single topic among all QA-Pairs. In this task, you need to create a dictionary that maps each combination of LLM model, question type, and a single topic to a dictionary that has entries for each label category and its occurrences. Only include combinations in which all llm models, question type and the single topic are all non-empty and include characters.  **Hint:** You may use defaultdict and Counter from Collections for this task.

In [78]:
def create_combination_label_mapping(qa_pairs: List[Tuple[str, str, str, List[str], str, str, List[str], List[str]]]) -> dict:
    """
    Generates a dictionary mapping every combination of question type, topic, and llm model name
    to a dictionary of labels and their total occurrences per category.

    Args:
        qa_pairs (List[tuple]): List of tuples containing information about QA pairs.

    Returns:
        dict[[tuple], dict[str: int]]: A dictionary mapping every combination of question type, topic, and llm model name
              to a dictionary of labels and their total occurrences.
    """
    combination_label_mapping = {}
    # YOUR CODE HERE
    c = Counter({'Exaggeration': 0,'Contradiction': 0,'Cannot assess': 0, 'Agree with the gold answer': 0, 'General comment': 0,  'Understatement': 0})
          
    
    for i in qa_pairs:
        for m in i[3]:
            combination = (i[0],m,i[4])
            # print(combination)
        # print(type(combination_label_mapping))
            counter = Counter(i[7])
        # print(counter)

        # if combination == ('2. Open ended - Comparison of different specific interventions', 'neurology', 'ChatGPT_prompt0'):
        #     print(Counter(i[7]))
        # if(i[0] == '2. Open ended - Comparison of different specific interventions' and i[4] == 'ChatGPT_prompt0' and 'neurology' in i[3]):
        #     print("Question Type:")
        #     print(i[0])
        #     print("Question:")
        #     print(i[1])
        #     print("Tpoics:")
        #     print(i[3])
        #     print("Model Name:")
        #     print(i[4])
        #     print("Labels:")
        #     print(i[7])
        #     print("")

        
            if combination not in combination_label_mapping:
                combination_label_mapping[combination] = Counter(i[7])
            else:
            # print("Before")
            # print(combination)
            # print(combination_label_mapping[combination])
            # print("")
            # print("Add")
            # print(Counter(i[7]))
            # print("")
                combination_label_mapping[combination] = combination_label_mapping[combination] + Counter(i[7])
            # print("After")
            # print(combination)
            # print(combination_label_mapping[combination])
            # print("")
            combination_label_mapping[combination].subtract(c)

        
    # print(combination_label_mapping)
    # for value in combination_label_mapping.values():
    #     if len(value) != 6:
    #         print(value)
    print("Result:")
    print(label_combinations[('2. Open ended - Comparison of different specific interventions', 'neurology', 'ChatGPT_prompt0')])
    print("Expected:")
    print({'Exaggeration': 0, 'Contradiction': 0, 'Cannot assess': 3, 'Agree with the gold answer': 5, 'General comment': 1, 'Understatement': 0})
    return combination_label_mapping

In [79]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
label_combinations = create_combination_label_mapping(data)
assert isinstance(label_combinations, dict)
assert all(isinstance(value, dict)  for value in label_combinations.values())
assert all(len(value) == 6 for value in label_combinations.values()) 
assert label_combinations[('2. Open ended - Comparison of different specific interventions', 'neurology', 'ChatGPT_prompt0')] == {'Exaggeration': 0, 'Contradiction': 0, 'Cannot assess': 3, 'Agree with the gold answer': 5, 'General comment': 1, 'Understatement': 0}

Result:
Counter({'Agree with the gold answer': 7, 'Cannot assess': 4, 'General comment': 3, 'Exaggeration': 2, 'Contradiction': 1, 'Understatement': 0})
Expected:
{'Exaggeration': 0, 'Contradiction': 0, 'Cannot assess': 3, 'Agree with the gold answer': 5, 'General comment': 1, 'Understatement': 0}


AssertionError: 

In [80]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __e)__ (20 Points) 
In this task, we want to find the combinations with the highest occurrence for each label category(6). Create a dictionary that maps each label category to a list of combinations like in 1d) and its occurences for this label category but only include entries that have the highest occurence for a label category . If multiple combinations appear the same amount of times for this label category, all of them should be represented in the result.

In [None]:
def find_highest_combinations(combination_label_mapping: dict[Tuple[str, str, str], dict[str, int]]) -> dict:
    """
    Finds combinations with the highest occurrence for each label.

    Args:
        combination_label_mapping (Dict): A dictionary mapping combinations to labels and their occurrences.

    Returns:
        dict[str, List[Tuple[Tuple[str, str, str], int]]]: A dictionary mapping each label to a list of combinations with the highest occurrence of this label.
                                                           Each entry in the list is a tuple containing the combination and its occurrence.
    """
    result_mapping = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return result_mapping

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
highest_combinations = find_highest_combinations(label_combinations)
assert isinstance(highest_combinations, dict)
assert all(isinstance(value, list)  for value in highest_combinations.values())
assert all(len(value[0]) == 2 for value in highest_combinations.values()) 
assert all(len(value[0][0]) == 3 for value in highest_combinations.values()) 
assert all(isinstance(value[0][1], int) for value in highest_combinations.values())
assert len(highest_combinations.items()) == 6
assert highest_combinations['General comment'] == [(('4. Open ended - General effects of a specific intervention', 'dementia', 'bingchat_prompt0'), 10)]

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

## Task 2 - 30 Points

In task 2 we will build a Decision Tree classifier and a Neural Network classifier and evaluate both afterwards. The first classifying task will ask the classifier to predict whether a given LLM answer is harmful or not. In the second classifying task, the classifier has to predict the label (6 categories) of a specific LLM answer unit.

In [None]:
# run this cell to import all modules needed for Task 2
from sklearn.feature_extraction import DictVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
## download this modules in case you have not downloaded them yet
#nltk.download('punkt')
#nltk.download('stopwords')

##### __a)__ (15 Points)
To achieve the first classification task mentioned above, we need to create a training and test split for the classifier. The input for the classifier should contain for each answer the question, the gold answer, the generated answer by the LLM, and additional features chosen by you that will improve the model's performance. You may use the extracted QA-Pairs from 1a). The dataset should be shuffled with the random seed 42. After shuffling the data, the dataset should be split on question level, meaning 80% of the QA-Pairs should be in the training split and 20% of the QA-Pairs should be in the test split. Prepare the data in such a way that for each data point all the following features: `question`,  `gold answer`, `large language model name`, `llm answer`, `num_answer_units`(the number of answer units generated by the LLM for each answer), `div_of_tokens`(the ratio of unique tokens of the LLM answer compared to all tokens of the generated answer by the LLM), `avg_len_words`(the average number of characters of all words for the given LLM answer) are represented as well as one of the two label categories(`harmful` and `non_harmful`). All of the highlighted features should be included and the features `num_answer_units`,`div_of_tokens` and `avg_len_words` should be only used for this task. Use **word_tokenize()** from nltk to tokenize the text, lowercase the tokens and remove punctuation and stopwords using `stopwords` from `nltk.corpus`. This task will be tested by testing functions and by evaluating the classifier's performance on the created datasets.

In [None]:
def create_train_test_split_answer(qa_pairs: List[Tuple[str, str, str, List[str], str, str, List[str], List[str]]]) -> Tuple[List[Tuple[dict, str]], List[Tuple[dict, str]]]:
    '''
    Create a feature set and perform a train-test split. The train split is used to train the classifiers on. The test split is used to evaluate the classifiers on.

    Args:
        qa_pairs (List[tuple]): List of tuples containing information about QA pairs.

    Returns:
        Tuple[List[Tuple[dict, str]], List[Tuple[dict, str]]]: 
        A tuple containing two lists:
        - The first list is the train split, where each entry is a tuple with a feature dictionary and its label ("harmful" or "non_harmful").
        - The second list is the test split, with the same format.
        - the feature dictionary should look like this: {"question": _, "gold_answer": _, "llm_answer": _, "llm_model_name": _,"num_answer_units": _,"div_of_tokens": _, "avg_len_words": _}
        Example: The features: num_answer_units, div_of_tokens and avg_len_words would be: {num_answer_units': 4, 'div_of_tokens': 0.7538461538461538, 'avg_len_words': 5.859813084112149} for the following LLM answer: 
                 "Macrolides have been studied as a potential treatment for chronic asthma, and their effectiveness compared to placebo varies depending on the specific study and patient population.
                 Some studies suggest that macrolides may help reduce asthma exacerbations and improve lung function in certain individuals with chronic asthma, while others show no significant difference compared to a placebo. 
                 The effectiveness of macrolides in asthma management may depend on factors such as the patient's asthma phenotype, the specific macrolide used, and the duration of treatment. 
                 Therefore, it's essential to consult with a healthcare professional to determine if macrolides are a suitable option for managing chronic asthma in a particular case." 
    '''
    train_split, test_split = [],[]
    # YOUR CODE HERE
    raise NotImplementedError()
    return train_split, test_split

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
from numpy import isclose
train_split_answer, test_split_answer = create_train_test_split_answer(data)
assert isinstance(train_split_answer, list)
assert isinstance(test_split_answer, list)
assert all(isinstance(element[0], dict) for element in train_split_answer) 
assert all(isinstance(element[1], str) for element in train_split_answer) 
assert all(isinstance(element[0], dict) for element in test_split_answer) 
assert all(isinstance(element[1], str) for element in test_split_answer) 
assert all("num_answer_units" in element[0].keys() for element in train_split_answer)
assert all("num_answer_units" in element[0].keys() for element in test_split_answer)
assert all("div_of_tokens" in element[0].keys() for element in train_split_answer)
assert all("div_of_tokens" in element[0].keys() for element in test_split_answer)
assert all("avg_len_words" in element[0].keys() for element in train_split_answer)
assert all("avg_len_words" in element[0].keys() for element in test_split_answer)

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __b)__ (15 Points)
For the second classification task , we also need to create a training and test split for the classifier. The input for the classifier should contain for each answer the question, the gold answer, the generated answer units by the LLM, and additional features chosen by you that will improve the model's performance. You may use the extracted QA-Pairs from 1a). The dataset should be shuffled with the random seed 42. After shuffling the data, the dataset should be split on question level, meaning 80% of the QA-Pairs should be in the training split and 20% of the QA-Pairs should be in the test split. Prepare the data in such a way that for each data point all the following features: `question`,  `gold answer`, `large language model name`, `llm answer unit`, `trigrams`(create a list of all trigrams for the tokenized an preprocessed answer unit and convert it into one string for adding to the feature dictionary, you can use `ngrams` for creating the list), `word_count`(the number of words in the llm answer unit), `token_overlap`(the number of unique tokens that overlap between the gold answer and the LLM answer unit) are represented as well as one of the six label categories. All of the highlighted features should be included and the features `trigrams`,`word_count` and `token_overlap` should be only used for this task. Preprocess the data for `trigrams`,`word_count` and `token_overlap` in the same way as in 2a) before extracting the features and **exclude** empty answer units. This task will be tested by testing functions and evaluating the classifiers perfomance on the created datasets.

In [None]:
def create_train_test_split_answer_units(qa_pairs: List[Tuple[str, str, str, List[str], str, str, List[str], List[str]]]) -> Tuple[List[Tuple[dict, str]], List[Tuple[dict, str]]]:
    '''
    Create a feature set and perform a train-test split. The train split is used to train the classifiers on. The test split is used to evaluate the classifiers on.

    Args:
        qa_pairs (List[tuple]): List of tuples containing information about QA pairs.

    Returns:
        Tuple[List[Tuple[dict, str]], List[Tuple[dict, str]]]: 
        A tuple containing two lists:
        - The first list is the train split, where each entry is a tuple with a feature dictionary and its label (6 categories).
        - The second list is the test split, with the same format.
        - the feature dictionary should look like this: {"question": _, "gold_answer": _, "llm_model_name": _, "llm_answer_unit": _, "trigrams": _, "word_count": _, "token_overlap": _}
        Example: The features: trigrams, word_count and token_overlap would be: {'trigrams': 'Intrathecal nusinersen shown nusinersen shown effective shown effective treating effective treating infants treating infants spinal infants spinal muscular spinal muscular atrophy muscular atrophy SMA atrophy SMA type SMA type compared type compared sham compared sham procedure',
                 'word_count': 14, 'token_overlap': 7} for the following LLM answer unit and gold answer: 
                 'llm_answer_unit': 'Intrathecal nusinersen has been shown to be effective in treating infants with spinal muscular atrophy (SMA) type I, compared to sham procedure.'
                 'gold_answer': 'Reviewers conducted a search in October 2018 and found only a single small RCT assessing nusinersen in infants with SMA type I. Nusinersen may reduce the combined outcome of death or need for full‐time ventilation, and infants may experience fewer severe adverse events compared with a sham procedure; effects on other outcomes were unclear.
                                 No firm conclusions can be drawn based on this small study. In addition to the paucity of evidence, nusinersen is incredibly expensive (125,000 USD per dose, with at least four doses in the first year of treatment), is dosed via lumbar/spinal tap, and may require sedation and special treatment centers for administration. Click here for further information.'
    '''
    train_split, test_split = [],[]
    # YOUR CODE HERE
    raise NotImplementedError()
    return train_split, test_split

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Public Tests
train_split_answer_units, test_split_answer_units = create_train_test_split_answer_units(data)
print(train_split_answer_units[0])
assert isinstance(train_split_answer_units, list)
assert isinstance(test_split_answer_units, list)
assert isinstance(train_split_answer_units[0][0], dict)
assert isinstance(train_split_answer_units[0][1], str)
assert isinstance(test_split_answer_units[0][0], dict)
assert isinstance(test_split_answer_units[0][1], str)
assert all("trigrams" in element[0].keys() for element in train_split_answer_units)
assert all("trigrams" in element[0].keys() for element in test_split_answer_units)
assert all("word_count" in element[0].keys() for element in train_split_answer_units)
assert all("word_count" in element[0].keys() for element in test_split_answer_units)
assert all("token_overlap" in element[0].keys() for element in train_split_answer_units)
assert all("token_overlap" in element[0].keys() for element in test_split_answer_units)

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __c)__ (0 Points)
This function is used for training the classifier on the training split created in 2a) and 2b). Feel free to use this function to test the training split created by you. You must use the same vectorizer object for training and testing the neural network.


In [None]:
def create_cls(train_split: List[Tuple[dict, str]], vectorizer: DictVectorizer) -> Tuple[nltk.DecisionTreeClassifier, MLPClassifier]:
    """
    Train a Decision Tree Classifier (DTC) and a Neural Network Classifier using the provided data.

    Parameters:
    - train_split (List[Tuple[dict, str]]): A list of tuples containing feature dictionaries and their corresponding labels.
    - vectorizer (DictVectorizer): An instance of DictVectorizer for converting feature dictionaries into a vector representation.

    Returns:
    Tuple[nltk.DecisionTreeClassifier, MLPClassifier]: A tuple containing the trained Decision Tree Classifier (DTC) and Neural Network Classifier.
    """

    # Separate the train set into features and labels
    feature_train, labels_train = zip(*train_split)
    
    # Train the Decision Tree Classifier
    dtc = nltk.DecisionTreeClassifier.train(train_split)

    # Embed the names into a vector representation
    train_vectorized = vectorizer.fit_transform(feature_train)
    
    # Create the Neural Network Model and train it on the data
    neural_network_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
    neural_network_classifier.fit(train_vectorized, labels_train)

    return dtc, neural_network_classifier

In [None]:
# Use this cell to train the classifier on the training split you created.
# YOUR CODE HERE

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests

##### __d)__ (0 Points)
This function returns the precision, recall and f1 for a given classifier and label category. Use this function to test the test split created by you.
You must use the same vectorizer object for training and testing the neural network.

In [None]:
def evaluate_classifier(classifier, vectorizer:DictVectorizer, test_split: List[Tuple[dict, str]], is_nn: bool, label_category: str) -> Tuple[float, float, float]:
    '''
    Evaluate a classifier using precision, recall, and F1 score for a specific label category.

    Args:
        classifier (object): The trained classifier model.
        vectorizer (DictVectorizer): The vectorizer used to transform features.
        test_split (List[Tuple[Dict[str, str], str]]): The test set is a list, where each entry is a tuple with a feature dictionary and its label.
        is_nn (bool): Indicates if the given classifier is a neural network (MLPClassifier), otherwise, it is a nltk.DecisionTreeClassifier.
        label_category (str): The specific label category for which the evaluation is performed.

    Returns:
        Tuple[float, float, float]: A tuple containing precision, recall, and F1 score.
    '''
    precision, recall, f_score = 0, 0, 0
    ### Begin Solution
    feature_test, labels_test = zip(*test_split)
    predictions = []
    if is_nn:
        # vetorize test features
        test_vectorized = vectorizer.transform(feature_test)
        predictions =  classifier.predict(test_vectorized)
    else: 
        predictions = [classifier.classify(features) for features in feature_test]
    precision = precision_score(labels_test, predictions, labels=[label_category], average='weighted',zero_division=0)
    recall= recall_score(labels_test, predictions, labels=[label_category], average='weighted',zero_division=1)
    f_score= f1_score(labels_test, predictions, labels=[label_category], average='weighted', zero_division=0)
    ### End Solution
    return precision, recall, f_score

In [None]:
# Use this cell to evaluate the classifier on the test split you created. 
# YOUR CODE HERE

In [None]:
# DO NOT MODIFY THIS TEST CELL
# Private Tests