# Prepare ICD9 Data from the MIMIC-III Demo Dataset for Conquery

This tutorial shows how data and meta data tables from the [MIMIC-III Demo Dataset](https://physionet.org/content/mimiciii-demo/1.4/) can be used to prepare data structures
needed for conquery.

In detail we will generate meta JSONs describing a table schema (Table-JSON), an import operation (Import-JSON, is much like the corresponding Table-JSON used for the preprocessing) and a concept (Concept-JSON, which offers the query functionality) from the table [DIAGNOSES_ICD.csv](https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv).

In [None]:
## The imports for this notebook
import pandas as pd
import io
import requests as r
import os
import json
from pathlib import Path
from zipfile import ZipFile
from io import BytesIO
import lib.conquery_util as cq
import re

# Define working directory
wd = Path(".")

## Meta Data Creation
We will start with the creation of the meta data. For Table-JSON and Import-JSON we need the header of the data table (DIAGNOSES_ICD.csv), we want to use later in conquery.
This process is rather generic, as it is usually just an annotation of the columns with type information.

For the Concept-JSON we will use the official ICD9 catalog to create a tree structured concept from the hierachical *icd9_code*.

### Download Data Table

In [None]:
data_url = "https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv?download"
s=r.get(data_url).content
data_df = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col="row_id", dtype={"subject_id": str, "hadm_id": str, "icd9_code": str })


# Write out the csv because it is needed for the preprocessing
data_file = wd / "data" / "csv" / cq.get_csv_name(data_url)
data_file.parent.mkdir(parents=True, exist_ok=True)

data_df.to_csv(data_file)

### Generate Table-JSON and Validate

In [None]:
table_name = data_file.name.split(".")[0]
table = cq.generate_table(table_name, data_df, "subject_id")

cq.get_validator(wd/"json_schema"/"table.json").validate(table)

table_json_file = wd / "data" / "tables" / f"{table_name}.table.json"
table_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(table_json_file, "w") as f:
    json.dump(table, f, indent="\t")

### Generate Import-JSON and Validate

In [None]:
table_name = data_file.name.split(".")[0]
import_ = cq.generate_import(data_df, "subject_id", data_file)

cq.get_validator(wd/"json_schema"/"import.json").validate(import_)

import_json_file = wd / "data" / "imports" / f"{table_name}.import.json"
import_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(import_json_file, "w") as f:
    json.dump(import_, f, indent="\t")

### Generate Concept-JSON

In this section we generate an ICD concept based in the official ICD-9 catalog (https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD9-CM/2011/Dtab12.zip).
We could have used the meta data provided by MIMIC-III (https://physionet.org/files/mimiciii-demo/1.4/D_ICD_DIAGNOSES.csv), but it misses structural and descriptive information (such as chapters, names of higher hierarchies and additional infos).

In [None]:
meta_url = "https://physionet.org/files/mimiciii-demo/1.4/D_ICD_DIAGNOSES.csv?download"
s=r.get(meta_url).content
meta_df = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col="row_id", dtype={"subject_id": str, "icd9_code": str })


Download and extract the ICD-9 catalog. 

In [None]:
meta_dir = wd / "data" / "meta"
meta_dir.mkdir(exist_ok=True, parents=True)

meta_url = "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD9-CM/2011/Dtab12.zip"
with ZipFile(BytesIO(r.get(meta_url, stream=True).content)) as zip_ref:
    zip_ref.extractall(meta_dir)

Read in the rtf format of the catalog. This takes several minutes!

In [None]:
from striprtf.striprtf import rtf_to_text

icd_file = meta_dir / "Dtab12.rtf"

with open(icd_file, "r") as f:
    icd_catalog = rtf_to_text(f.read())

Parse the catalog into a hierarchical structure that suits conquery as a concept.  

We define the `concept` foundation and then add the children tree to it.
Because the catalog has already the sorted structure, we can define single state variables (`in_chapter`, `in_section`, ...) to keep track of where we have been recently when we insert new nodes to the tree.

An ICD-9 code can have up to 5 levels. The first two levels are specified by a range in which a fixed length prefix might fall. All lower sections are determined by a single fixed length prefix. Fortunately, we can distinguish lines in the catalog regarding their level by matching each line to a distinct RegEx schema (`chapter_matcher`, `section_matcher`, ...).
The lines that do not match are treated as additional infos and are appended to the last parsed element `in_recent`.
If present, we add additional infos to each node, which is displayed in the left column in the frontend. For the ICD-9 codes, this usually includes indication whether the code can be applied using *Exclude* and *Include* sections.

In [None]:
# These matchers are used to scrape information for each level from the parsed icd catalog
chapter_matcher = re.compile(r"^(?P<chapter>\d+\.)\s*(?P<name>[A-Z, -]+)\((?P<start>\d{3})-(?P<end>\d{3})\)$")
section_matcher = re.compile(r"^(?P<name>[A-Z, -]+)\((?P<start>\d{3})-(?P<end>\d{3})\)$")
subsection_matcher = re.compile(r"^(?P<prefix>[\d]{3})\s+(?P<name>[\w\d ()\-,\[\]\.]+)")
subsubsection_matcher = re.compile(r"^(?P<prefix>[\d]{3}\.\d)\s+(?P<name>[\w\d ()\-,\[\]\.]+)")
subsubsubsection_matcher = re.compile(r"^(?P<prefix>[\d]{3}\.\d{2})\s+(?P<name>[\w\d ()\-,\[\]\.]+)")

# The concepts builds the root of the hierarchy
concept = {
    # Placeholder for the rest of the tree
    "children": [],
    # Defines which columns represent codes for this concept
    "connectors": [{
        "column": "DIAGNOSES_ICD.icd9_code",
        "label": "Diagnoses"
    }],
    # The display name
    "label" : "ICD",
    # The internal name that is used to create an id (must be unique in a dataset among concepts, tables and secondaryIds)
    "name" : "icd",
    # Selects define aggregations that create additional columns in the output
    "selects": [],
    # At the moment there is just this type TREE
    "type": "TREE"
}

in_chapter = None
in_section = None
in_subsection = None
in_subsubsection = None
in_subsubsubsection = None
in_recent = None
additional_info_key = ""
for line in icd_catalog.split("\n") :

    # Chapter
    match = chapter_matcher.match(line)
    if match:
        in_chapter = {
            # The label will be displayed in the concept overview, the query editor and the query result
            "label": f"{match.group('chapter')}",
            # The description is displayed in the concept overview, the query editor
            "description": match.group("name").title(),
            "condition": {
                # Defines with codes fall into this chapter/section 
                "type": "EQUAL",
                "values": []
            },
            # Placeholer for sections, the next level in the tree
            "children": [],
            # Placeholder for 
            "additionalInfos": [],
        }
        in_recent = in_chapter
        # Reset sub-levels
        in_section = None
        in_subsection = None
        in_subsubsection = None
        in_subsubsubsection = None
        additional_info_key = ""
        
        # insert node
        concept["children"].append(in_chapter)
        
        continue

    # Section
    match = section_matcher.match(line)
    if match:
        in_section = {
            "label": f"{match.group('start')}-{match.group('end')}",
            "description": match.group("name").title(),
            "condition": {
                "type": "EQUAL",
                "values": []
            },
            "children": [],
            "additionalInfos": [],
        }
        
        in_recent = in_section
        
        # Reset sub-levels
        in_subsection = None
        in_subsubsection = None
        in_subsubsubsection = None
        additional_info_key = ""
        
        # insert node
        in_chapter["children"].append(in_section)
        
        continue

    # Subsection
    match = subsection_matcher.match(line)
    if match:
        
        code = match.group("prefix")
        
        in_subsection = {
            "label": match.group("prefix"),
            "description": match.group("name"),
            "condition": {
                "type": "EQUAL",
                "values": [code]
            },
            "children": [],
            "additionalInfos": [],
        }
        
        in_recent = in_subsection
        
        # Reset sub-levels
        in_subsubsection = None
        in_subsubsubsection = None
        additional_info_key = ""

        # insert node
        upper_level = in_section or in_chapter
        upper_level["children"].append(in_subsection)
        
        # propagate matching code to EQUAL-condition of upper levels
        for parent in [in_section, in_chapter]:
            if parent:
                parent["condition"]["values"].append(code)
        continue

    # Subsubsection
    match = subsubsection_matcher.match(line)
    if match:
        # the descriptive codes differ from the codes in the data in that they contain a dot that we strip
        code = match.group("prefix").replace(".","")
        
        in_subsubsection = {
            "label": match.group("prefix"),
            "description": match.group("name"),
            "condition": {
                "type": "EQUAL",
                "values": [code]
            },
            "children": [],
            "additionalInfos": [],
        }
        in_recent = in_subsubsection
        
        # Reset sub-levels
        in_subsubsubsection = None
        additional_info_key = ""

        # insert node
        upper_level = in_subsection or in_section or in_chapter
        upper_level["children"].append(in_subsubsection)
        
        # propagate matching code to EQUAL-condition of upper levels
        for parent in [in_subsection, in_section, in_chapter]:
            if parent:
                parent["condition"]["values"].append(code)
                
        continue

    # Subsubsubsection (lowest-level)
    match = subsubsubsection_matcher.match(line)
    if match:
        # the descriptive codes differ from the codes in the data in that they contain a dot that we strip
        code = match.group("prefix").replace(".","")
        
        in_subsubsubsection = {
            "label": match.group("prefix"),
            "description": match.group("name"),
            "condition": {
                "type": "EQUAL",
                "values": [code]
            },
            "children": [],
            "additionalInfos": [],
        }
        in_recent = in_subsubsubsection
        
        # Reset sub-levels
        additional_info_key = ""

        # insert node
        upper_level = in_subsubsection or in_subsection or in_section or in_chapter
        upper_level["children"].append(in_subsubsubsection)
        
        # propagate matching code to EQUAL-condition of upper levels
        for parent in [in_subsubsection, in_subsection, in_section, in_chapter]:
            if parent:
                parent["condition"]["values"].append(code)
        
        continue

    # Additional Infos
    if not in_recent:
        continue
    value = line
    if line.startswith('Excludes:'):
        additional_info_key = "Excludes:"
        # +1 for the \t 
        value = value[len(additional_info_key)+1:]
    elif line.startswith('Includes:'):
        additional_info_key = "Includes:"
        value = value[len(additional_info_key)+1:]

    additional_info = in_recent["additionalInfos"]

    # Should be one item at max
    items = list(filter(lambda i: i["key"] == additional_info_key, additional_info))

    if len(items) > 1:
        raise RuntimeError(f"Expected key {additional_info_key} to appear at most once")
    if len(items) == 0:
        # First time this key appeared
        additional_info.append(
            {
                "key": additional_info_key,
                "value": value
            }
        )
        continue

    items[0]["value"] += f"\n{value}"
    

Validate and write the concept.

In [None]:
cq.get_validator(wd / "json_schema" / "concept.json").validate(concept)

concept_json_file = wd / "data" / "concepts" / "icd.concept.json"
concept_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(concept_json_file, "w") as f:
    json.dump(concept, f, indent="\t")

## Preprocessing and Upload

The next tutorial is to [Preprocess and Upload](./preprocess_and_upload.ipynb) all data and meta data produced from this notebook.
