# Generate Meta Data and Preprocess Data from the MIMIC-III Demo Dataset for Conquery

This tutorial shows how data and meta data tables from the [MIMIC-III Demo Dataset](https://physionet.org/content/mimiciii-demo/1.4/) can be used to prepare data structures
needed for conquery.

In detail we will generate meta JSONs describing a table schema (Table-JSON), an import operation (Import-JSON, is much like the corresponding Table-JSON used for the preprocessing) and a concept (Concept-JSON, which offers the query functionality) from the tables [DIAGNOSES_ICD.csv](https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv) and [D_ICD_DIAGNOSES.csv](https://physionet.org/content/mimiciii-demo/1.4/D_ICD_DIAGNOSES.csv).

Then we will use the Import-JSON to preprocess DIAGNOSES_ICD.csv to a DIAGNOSES_ICD.cqpp (**c**on**q**uery **p**re**p**rocessed).

Finally a dataset *MIMIC-III-Demo* will be created in an instance of conquery and the Table-JSON, Concept-JSON and DIAGNOSES_ICD.cqpp will be uploaded.

In [37]:
## The imports for this notebook
import pandas as pd
import io
import requests as r
import json
import numpy as np
import os
import re
from enum import Enum, auto
from jsonschema import Draft7Validator, RefResolver
from pathlib import Path

# Define working directory
wd = Path(".")

## Helper Functions

In [66]:
class CQTypes(Enum):
    STRING = auto()
    INTEGER = auto()
    BOOLEAN = auto()
    REAL= auto()
    DECIMAL= auto()
    MONEY= auto()
    DATE= auto()
    DATE_RANGE= auto()

def get_csv_name(url):
    filename_matcher = re.compile(r"[\w\d_-]+\.csv")
    match = filename_matcher.search(url)
    if not match:
        raise ValueError(f"Unable to extract file name from {url}")
    return match.group(0)


def typeConverter(dtype) :
    if np.issubdtype(dtype, np.object) :
        return CQTypes.STRING.name
    if np.issubdtype(dtype, np.integer) :
        return CQTypes.INTEGER.name
    if np.issubdtype(dtype, np.bool_) :
        return CQTypes.BOOLEAN
    if np.issubdtype(dtype, np.inexact) :
        return CQTypes.REAL
    # DECIMAL cannot be derived from the dtype because there is no analogon
    # MONEY cannot be derived from the dtype because it is a semantic rather than a logical type
    if np.issubdtype(dtype, np.datetime64):
        return CQTypes.DATE.name
    # DATE_RANGE not supported here yet
    raise ValueError(f"Encountered unhandled dtype: {dtype}")

def generate_table_column(name, dtype) :
    return {
        "name": name,
        "type" : typeConverter(dtype)
    }

def generate_table(name, df) :
    return {
        "name" : name,
        "columns": [ generate_table_column(name, dtype) for name, dtype in zip(df.dtypes.keys().array, df.dtypes.values)]
    }

def generate_import_column(name, dtype) :
    return {
        "inputColumn": name,
        "inputType": typeConverter(dtype),
        "name": name,
        "operation": "COPY"
    }

def generate_import(df, primary_column, source_file) :


    col_names = list(df.columns.values)
    col_names.remove(primary_column)
    non_primary_df = data_df[col_names]

    # Skip the filename suffix
    table_label = source_file.name.split(".")[0]

    return {
        "inputs": [
            {
                "output": [ generate_import_column(name, dtype) for name, dtype in zip(non_primary_df.dtypes.keys().array, non_primary_df.dtypes.values)],
                "primary": {
                    **generate_import_column(primary_column, df[[primary_column]].dtypes.values[0]),
                    "required": True,
                },
                "sourceFile": source_file.as_posix()
            }
        ],
        "table": table_label,
        "name": table_label
    }


"""
Create a validator from a base schema in the directory "./json_schema"
"""
def get_validator(base_schema_file):
    schema_store = {}

    directory = wd / "json_schema"
        
    for file in list(directory.glob("*.json")):
        
        with open(file, "r") as schema_file:
            schema = json.load(schema_file)
            schema_store[file.name] = schema

    resolver = RefResolver.from_schema(schema, store=schema_store)
    return Draft7Validator(schema_store[base_schema_file], resolver=resolver)

## Meta Data Creation
We will start with the creation of the meta data. For Table-JSON and Import-JSON we need the header of the data table (DIAGNOSES_ICD.csv), we want to use later in conquery.
This process is rather generic, as it is usually just an annotation of the columns with type information.

For the Concept-JSON we will use the meta data table (D_ICD_DIAGNOSES.csv) to create a tree structured concept from the hierachical *icd9_code*.

### Download Data Table

In [39]:
data_url = "https://physionet.org/files/mimiciii-demo/1.4/DIAGNOSES_ICD.csv?download"
s=r.get(data_url).content
data_df = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col="row_id", dtype={"subject_id": str, "hadm_id": str, "icd9_code": str })


# Write out the csv because it is needed for the preprocessing
data_file = wd / "data" / "csv" / get_csv_name(data_url)
data_file.parent.mkdir(parents=True, exist_ok=True)

data_df.to_csv(data_file)

### Generate Table-JSON and Validate

In [60]:
table_name = data_file.name.split(".")[0]
table = generate_table(table_name, data_df)

get_validator("table.json").validate(table)

table_json_file = wd / "data" / "tables" / f"{table_name}.table.json"
table_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(table_json_file, "w") as f:
    json.dump(table, f, indent="\t")

### Generate Import-JSON and Validate

In [67]:
table_name = data_file.name.split(".")[0]
import_ = generate_import(data_df, "subject_id", data_file)

get_validator("import.json").validate(import_)

import_json_file = wd / "data" / "imports" / f"{table_name}.import.json"
import_json_file.parent.mkdir(parents=True, exist_ok=True)

with open(import_json_file, "w") as f:
    json.dump(import_, f, indent="\t")