## cTAKES Analysis File

This file performs analysis on the NER.

### Overview:
Our study analyzes the changes in performance of using Direct vs. CoT prompting. We observe that in many scenarios, CoT answering underperforms. In this section, we can further examine each of the CoT outputs of each model-task pair using cTAKES, a tool that automatically extracts key clinical information from unstructured text.


#### File Breakdown:
There are around 87 tasks, each evaluated on 52 models. yeilding a total of around 4000 files. 

#### Analysis Plan:
As a first step, we should find a way to efficiently process all of the .xmi files. This means for each task-model file, we extract all the relevant ninstances of clinical concepts extracted by cTAKES and store it in a simple data structure (or possibly even output into a CSV).

From there, we have a collection of all clinical information mentions for each CoT output of each task. Using this, you can conduct deeper analysis (somehow)

___

### Load Dependencies

In [None]:
import os
import json
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET


path_to_ctakes = "/Users/kevinxie/Desktop/LLM CoT/cTAKES"
all_tasks = os.listdir(path_to_ctakes)



# Directory containing your .xmi files
directory = "/Users/kevinxie/Desktop/LLM CoT/cTAKES/1-2.ADE-ADE relation/DeepSeek-R1"

____

### Create Functions

In [79]:
def get_ontology_concepts(root):
    """

    """
    # All ontology concepts in the file
    elements_with_ontology_concept_arr = []
    for elem in root.iter():
        if 'ontologyConceptArr' in elem.attrib:
            elements_with_ontology_concept_arr.append(elem)

    # Store in a dictionary
    elements_with_ontology_concept_dict = {}

    # Attributes: {'{http://www.omg.org/XMI}id': '1337', 'sofa': '1', 'begin': '318', 'end': '321', 'id': '0', 'ontologyConceptArr': '1313 1323', 'typeID': '1', 'discoveryTechnique': '1', 'confidence': '0.0', 'polarity': '1', 'uncertainty': '0', 'conditional': 'false', 'generic': 'false', 'subject': 'patient', 'historyOf': '0'}

    # Create a dictionary uniquely keyed by the xmi:id of each concept
    for concept in elements_with_ontology_concept_arr:
        # Get the ontologyConceptArr attribute
        ontology_concept_arr = concept.attrib['ontologyConceptArr']

        # Get the xmi:id attribute
        xmi_id = concept.attrib.get("{http://www.omg.org/XMI}id")

        # Store in a dictionary
        elements_with_ontology_concept_dict[xmi_id] = {
            'ontologyConceptArr': ontology_concept_arr,
            'begin': concept.attrib['begin'],
            'end': concept.attrib['end'],
            'polarity': concept.attrib['polarity'],
            'subject': concept.attrib['subject'],
            'historyOf': concept.attrib['historyOf']
        }

    # maps concept --> additional information!
    return elements_with_ontology_concept_dict


def get_umls_concepts_from_xmi(directory):
    """
    Given the path to a folder containing all .xmi files
    for a task evaluated on a specific model, this function
    will extract the UMLS concepts from the .xmi files
    and return them as a list of dictionaries.

    Args:
        ctakes_path (str): Path to the directory containing .xmi files.
    Returns:
        list: A list of dictionaries, each containing UMLS concept information.
    """
    # Check if the directory exists
    if not os.path.exists(directory):
        print(f"Directory {directory} does not exist.")
        return

    # Create a list to store the extracted concepts
    umls_concepts_list = []

    # Iterate through all .xmi files in the directory
    for filename in os.listdir(directory):
        print(filename)
        if filename.endswith(".xmi"):
            file_path = os.path.join(directory, filename)

            # Parse the .xmi file
            tree = ET.parse(file_path)
            root = tree.getroot()

            # Extract all <refsem:UmlsConcept> elements
            namespaces = {
                "refsem": "http:///org/apache/ctakes/typesystem/type/refsem.ecore",
                "tcas": "http:///uima/tcas.ecore",
                "xmi": "http://www.omg.org/XMI"
            }
            umls_concepts = root.findall(".//refsem:UmlsConcept", namespaces)

            ontology_concept_dict = get_ontology_concepts(root)

            # return ontology_concept_dict


            for concept in umls_concepts:
                umls_concept = {
                    "xmi:id": concept.attrib.get("{http://www.omg.org/XMI}id"),
                    "codingScheme": concept.attrib.get("codingScheme"),
                    "code": concept.attrib.get("code"),
                    "preferredText": concept.attrib.get("preferredText"),
                    "cui": concept.attrib.get("cui"),
                    "tui": concept.attrib.get("tui")
                }

                id = concept.attrib.get("{http://www.omg.org/XMI}id")

                # ontonology_
                for key, d in ontology_concept_dict.items():
                    ontology_concept_arr = d["ontologyConceptArr"].split(" ")

                    if id in ontology_concept_arr or id == key:
                        umls_concept["begin"] = ontology_concept_dict[key]["begin"]
                        umls_concept["end"] = ontology_concept_dict[key]["end"]
                        umls_concept["polarity"] = ontology_concept_dict[key]["polarity"]
                        umls_concept["subject"] = ontology_concept_dict[key]["subject"]
                        umls_concept["historyOf"] = ontology_concept_dict[key]["historyOf"]
                        umls_concept["ontologyConceptArr"] = ontology_concept_dict[key]["ontologyConceptArr"]

                umls_concepts_list.append(umls_concept)

        break


    return umls_concepts_list


In [82]:
output = get_umls_concepts_from_xmi("/Users/kevinxie/Downloads/22.CLIP/Athene-V2-Chat")
output

6495_0.txt.xmi


[{'xmi:id': '1313',
  'codingScheme': 'SNOMEDCT_US',
  'code': '9721008',
  'preferredText': 'Phencyclidine',
  'cui': 'C0031381',
  'tui': 'T109',
  'begin': '318',
  'end': '321',
  'polarity': '1',
  'subject': 'patient',
  'historyOf': '0',
  'ontologyConceptArr': '1313 1323'},
 {'xmi:id': '1323',
  'codingScheme': 'SNOMEDCT_US',
  'code': '9721008',
  'preferredText': 'Phencyclidine',
  'cui': 'C0031381',
  'tui': 'T121',
  'begin': '318',
  'end': '321',
  'polarity': '1',
  'subject': 'patient',
  'historyOf': '0',
  'ontologyConceptArr': '1313 1323'},
 {'xmi:id': '1476',
  'codingScheme': 'SNOMEDCT_US',
  'code': '13924000',
  'preferredText': 'Injury wounds',
  'cui': 'C0043250',
  'tui': 'T037',
  'begin': '50',
  'end': '55',
  'polarity': '1',
  'subject': 'patient',
  'historyOf': '0',
  'ontologyConceptArr': '1476 1466'},
 {'xmi:id': '1466',
  'codingScheme': 'SNOMEDCT_US',
  'code': '416462003',
  'preferredText': 'Injury wounds',
  'cui': 'C0043250',
  'tui': 'T037',
  

In [78]:
with open('output.json', 'w') as f:
    json.dump(output, f, indent=4)

preferredText --> identified concept based on the CUI (NOT THE EXACT TEXT)
cui --> cui of the extracted concept
tui --> tui of the concept
xmi:id --> used to find additional information about the extracted concept


Rather, this tells you that inside of the CoT response contains information about the [preferredText], its CUI is [cui], and its TUI is [tui].

Using the [xmi:id], you can find additional information about its (1) polarity, (2) subject, (3) historyOf.